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PREFACE TO THE SECOND EDITION 


The most important changes that have been made in the second 
edition of this book are as follows. 


1. The pertinent results and implications of approximately one 
hundred and eighty research studies and publications of recent 
date have been incorporated into the text, and the material has 
been documented in the bibliographies. 

2. Twelve recent tests.of major importance have been described 
and evaluated, including such instruments as the Kuder Preference 
Record, the Minnesota Multiphasic Personality Inventory, the 
Bennett Test of Mechanical Comprehension, and the SRA Primary 
Mental Abilities Test for Children. 


3. The chapters on intelligence testing have been radically re- 
Organized to approach as closely as possible a chronological pres- 
entation of the topics treated. In a book of this type a strict 
chronological order does not seem desirable, as it would separate 
closely related material, as, for example, the Terman-McNemar 
Test from the Terman Group Test. 

In addition, quite a number of topics have been treated with a 
Somewhat different emphasis and somewhat more fully. These 
include reliability, validity, factor analysis, the implications of 
testing during World War II, and so forth. The indexing has been 
revised and amplified, with the aim of making it more serviceable, 
and the same applies to all documentation. 

One major feature of the book has, after careful consideration, 
been retained. This is the treatment of general topics having to do 
with the logic of measurement, such as validity, reliability, and 
types of scores, in an early chapter, and the return to many of 
these topics towards the close of the book. There is an obvious 
argument for grouping all such material at one point. The reason 
for not doing this is the belief that the student may well be in a 
better position to grasp the broader significance of the issues in- 
Volved after he has dealt with a wide selection of specific applica- 
tions, while at the same time he can hardly be expected to stud 
specific measuring instruments intelligently without a SeLETIN 
orientation. 
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Once again the position taken in the entire treatment is that 
psychological tests are essentially practical tools and they must 
be understood and evaluated as such. This point of view clearly 
has a bearing on all the fundamental issues of the theory of 
measurement, and the attempt has been made to show what its 
implications are in this respect. 


JaMEs L. MursELL 


PREFACE TO THE FIRST EDITION 


My purpose in this book is to present a comprehensive and 
balanced account of the testing movement in psychology, taking 
into consideration its past development, its present status, and 
its future prospects. This has determined the method of treat- 
ment throughout, the selection of topics for consideration, and the 
relative emphasis upon them. 

In my opinion psychological tests must frankly be regarded as 
practical instruments, and used, evaluated, and interpreted as 
such. Indeed, I believe that its practical orientation has from the 
first kept the testing movement in the path of sanity and realism. 
This, of course, does not mean that it is unrelated to such broad 
Psychological problems as the nature of mental growth, the rela- 
tive effect of hereditary and environmental influences, the nature 
of mental organization, or the issue between associationist and 
configurationist views. Moreover, the student of psychological test- 
ing must be aware of the relevance of these problems if he is to 
select and use instruments of measurement wisely and to interpret 
their results in an enlightened fashion. As a worker in the field 
of psychology the student may very properly have decided views 
On all such matters, but the testing movement as such does not 
prejudge them and has no final answers. I cannot, for instance, 
see that it presupposes either a hereditarian or a “mechanistic” 
position. Many criticisms of mental testing apply, not to the sub- 
ject itself, but to the views of persons prominently associated with 
it. Such criticisms may or may not be well taken, but the issue 
needs to become clear if much confusing and unintelligent par- 
tisanship is to be avoided. 

If this prevailing point of view, together with the general pur- 
pose of the book, is kept in mind, the choice of topics and their 
relative emphasis and subordination will become clear, and will, 
I hope, appear reasonable. There is, for instance, at present a very 
widespread and intensive interest in factor analysis, yet I have 
not gone into it in great detail, though I would not be thought to 
deprecate its importance and ultimate promise. The reasons are 
first that its psychometric and psychological bearings do not yet 


vii 


viii PSYCHOLOGICAL TESTING 


seem to me to have clarified themselves, and second that the gen- 
eral student needs an intelligent comprehension of what is being 
undertaken rather than a detailed account of findings that are 
often conflicting, and of intricate controversies so far rather re- 
mote from practical issues. In the same way, an immense amount 
of work is going on in projective testing, which is far too impor- 
tant to be overlooked. But this again seems to me to be a field for 
special study, so that what the general student needs is a broad 
orientation rather than a detailed familiarity with the enormous 
mass of interpretative data now available. 

In choosing specific tests for analysis and discussion, I have in 
the main tried to select instruments which are first-rate examples 
of their type, although there are a few negative instances of tests 
open to very serious criticism. In presenting numerous synoptic 
outlines, my purpose has been to enable the reader to get a fairly 
adequate concrete idea of the tests from the book itself, although 
in the footnotes, bibliographies, and indexes I have tried to pro- 
vide him with facilities by which he can readily expand his ac- 
quaintance with them, and with the literature pertaining to them. 
I have thought it best to abandon the classification of test types 
into the numerous small subdivisions often found in favor of a few 
larger ones. In dealing with intelligence tests, the broad chrono- 
logical perspective on their development, which is now becoming 
possible, has seemed to me to be sufficiently illuminating to de- 
serve emphasis. I have tried to take Proper cognizance of the most 
recent work in the field, notably that having to do with the effect 
of preschool environment, foster-home environment, socioeconomic 
factors, and the effect of advanced age upon mentality and test 
performance. Notice also has been taken of the work done in 
mental testing during World War II. The Purpose, however, has 
been to treat these subjects, not so much for their own interest 
as for their bearing upon psychometric theory and practice. 

I have definitely decided against including any treatment of 
elementary statistical practice. The reason is that I do not believe 
this properly belongs in a general book on Psychological testing. 
If it is included at all, the treatment is bound to be scanty, inade- 
quate, and seriously misleading. It seems to me that any serious 
student of mental testing should be told frankly that he Ought to 
be willing to spend the time and effort—and the expenditure need 
not be exorbitant—to understand the statistical concepts and 
techniques involved, either as a collateral Part of his study, or as 
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a prerequisite to it. There is only one way to grasp the significance 
of measures of central tendency, or dispersion, or relationship, or 
of the probability distribution, and that is to work with and 
manipulate them for oneself. To speak candidly, I believe thal. 
the extremely superficial treatments of these subjects not seldom 
found are likely to do more harm than good, because they produce 
the illusion of understanding without the reality. I make no apolo- 
gies for presenting the subject of psychological testing on the 
assumption that it is a serious one, deserving a serious approach. 


J.L.M. 
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CHAPTER I 


THE GENERAL CHARACTERISTICS OF 
MENTAL TESTS 


WauHar Is A PsycHOLOGICAL TrEsT? 


Alfred Binet, who may be considered the originator of modern 
psychological measurement, in answer to his critics, undertook a 
piece of informal investigation which well reveals the nature of 
all mental tests. He invited to his laboratory three teachers who 
were to judge the intelligence of children unknown to them, each 
in any way he pleased. It turned out that each of these teachers 
used substantially the same method. One of them asked tlhe 
children the purposes of canals and sluices. Another showed the 
children some pictures, and requested interpretations and com- 

ments. Another asked about the details of the then recent death 
of King Edward VII. The names of neighborhood streets, the 
proper road to take in order to reach a designated place, whether 
factory walls should be made thick or thin were other typical 
inquiries presented (v. Binet, Ppp. 182 ff.; Terman, 1916, Pp. 
Te EE) 
a as in essence was the testing method developed by Binet 
and followed by his successors. The teachers utilized the method 
crudely. The questions were special and often had a very local 
and limited reference. They were asked in different ways, so that 
their difficulty varied even when they dealt with the same topics. 
There was no set standard for evaluating and interpreting the 
answers which the children made. As Binet put it, “The teachers 
employed very awkwardly a very excellent method.” But it was 
the only method they could find to use. A properly constructed 
and administered mental test refines, standardizes, and elaborates 
what these teachers did in their attempt to reach an appraisal of 
the mentality of the children they were examining. 
NA C 4 psychological test, then, ts a pattern of stimuli selected and 
organized to elicit responses which will reveal certain. psycho- 
the person who makes them. The psycho- 


logical characteristics in 5 The p: 
logical characteristics in question may be general intelligence, 
* These and similar notations refer to the bibliography at the close of the book. 
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numerical ability, musical talent, mechanical aptitude, aptitude 
for some specific vocation, a certain phase or type of interest, a 
certain type of emotional or personal set such as introversion or 
submissiveness, and so on. The stimuli may be pairs of words to 
be marked as having the same or different meanings, or a colored 
design to be copied by using varicolored blocks, or sentences each 
with some word or words omitted to be filled in so as to make 
sense, or lists of occupations and activities to be marked as liked 
or disliked, and so on. These are all typical examples from actual 
use. 
There are one or two points of terminology which it is well to 
“ be clear about from the outset. The separate stimulus items—the 
word pairs, the blocks and the design, the various’ incomplete 
sentences, the various occupations or activities listed —are usually 
called test items. They are the ultimate constituents out of which 
the test is built. Again, a great many published tests are divided 
into subtests, which usually consist of the same kind of items. 
Thus Army Group Intelligence Examination Alpha, developed 
for the United States Army during World War I, comprises 
ten subtests. It contains a set of brief arithmetical problems, a 
set of questions involving problems of practical judgment or 
common sense, a set of word pairs to be marked as having the 
same or different meanings, a set of disarranged sentences to 
be interpreted, a set of incomplete numerical series to be com- 
pleted, a set of problems which requires the subject to find 
analogies to certain Words, a set of information items in mul- 
tiple choice form, and also three other subtests. The intention 
always is to set up an orderly and organized pattern of stimuli 
which will reveal the mental Characteristics of the person who 
makes the responses, and also to show how the responses them- 
Selves must be interpreted if the mental Characteristics they are 
supposed to indicate are to be Correctly appraised. Tests so con- 
ceived are clearly a refinement and standardization of what was 
done by Binet’s three teachers. Such is the essential nature of a 
Psychological test. 
There are two distinctions which further clarify the matter. 


‘1. How does a psychological test differ from a Psychological 
experiment? 


r 


In externals, at any rate, experiments and tests are often 
very similar. Both of them involve the presentation of stimulus 
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situations and the appraisal of responses. Moreover, the types of 
stimuli used are often pretty much the same. Thus paper and 
pencil mazes have been used both as test items and for experi- 
mental purposes in the psychological laboratory. The difference 
lies inthe purpose for which the material is set up)In the maze 
test, a series of mazes of increasing difficulty is presented for solu- 
tion by children ranging in age from three to fourteen years. The 
intention is to reveal certain aspects of intelligence, and notably 
“prudent and considered behavior.” Such is the purpose of the 
test (Porteus, 1915, 1924). In the laboratory, however, mazes have 
been used to study the learning process, and to determine what 
is involved in becoming able to run them correctly and confidently. 
In the same way, series of digits for immediate recall. have been 
used both in testing and in experimental work. But\in the first 
case the purpose is to reveal the ability of the subject; whereas 
in the second, it is to investigate the process of memory. So, in 
broad terms, the difference is that psychological tests aim to reveal 
the characteristics of persons, and psychological experiments aim 
to reveal the characteristics of mental processes, 

The distinction, of course, is far from absolute. Any adequate 
tality and personality of a human being 
for an understanding of the nature of his 
mental processes. If it is proposed to rate him on general intelli- 
gence, introversion, or interest, the question of what these traits 
really are is certainly involved; and until it is answered, thor- 
oughly satisfactory tests cannot be constructed. 

Moreover, at the present time these two lines of work, the one 
in mental measurement and the other in experimentation, which 
have heretofore been pretty separate, are coming together. In par- 
ticular, workers in the field of psychological testing are becoming 
more and more concerned with the nature of the processes with 
Which they try to deal. With the accumulation of test data has 
come the belief that they ought to throw light not only on the 
Characteristics of persons, but on the nature of mental Processes 
and the organization of the mind as well. An elaborate and still 
controversial body of techniques known. as Jactor analysis has 
been developed with this latter consideration in mind. One impor- 
tant factorial study has led to the conclusion that performance on 
a certain set of tests involves seven mental processes—numerical 

£ lization of space, memory for words, 


facility, word fluency, visua ঢ i 
En numbers perceptual speed, verbal reasoning, and induc- 
a য) 


appraisal of the men 
would obviously call 


4 PSYCHOLOGICAL TESTING 


tion (Thurstone, 1938). Another outstanding investigator, using 
these techniques, maintains that most mental test performances 
call chiefly for a general factor which seems to consist of a general 
intellectual energy (Spearman, 1927). We shall return to this 
subject more fully later on. But for the moment the point is that 
though a psychological experiment and a psychological test have 
different orientations, one having to do with persons and the other 
with processes, the two lines of work are already converging and 
are likely to come together more and more as time goes on. 


2. How do psychological tests differ from educational tests? 


Jn general, there is a broad and obvious working distinction 
between tests dealing with mental processes and characteristics 
and those dealing with achievement in school subjects, such as 
reading, arithmetic, spelling, social studies, science, and the like. 
The form is the same in both kind of tests. Items or stimulus 
situations are set up to evoke revealing responses. But the pur- 
pose is different. 

Here again, however, the distinction is far from absolute. Edu- 
cational tests obviously involve mental processes. Achievement in 
a school subject may call, at any rate, for memory, and often for 
understanding, insight, the ability to solve problems or to collate 
data, and for emotional and aesthetic responses. as well. On the 
other hand, mental tests, and Particularly those calling in the main 
or exclusively for verbal responses, draw heavily on the material 
learned and the skills engendered in school. Thus it has been 
shown that many standard intelligence tests, such as the well- 
known and widely used National Intelligence Tests, have a great 
deal in common with inclusive batteries of educational achieve- 
ment tests covering a wide range of school subjects (Gates and 
LaSalle). So, too, it is recognized that a well-chosen and diver- 
sified set of educational tests measures psychological processes 
quite satisfactorily. In one very competent investigation, the com- 
bined standing of a group of subjects on a reading test and an 
arithmetic test was taken as an index of intelligence, revealing 
Just about what a general intelligence test would show (Lorge, 
1945). 

The essential distinction, then, is not between two quite dif- 
ferent kinds of processes. Rather, it is one between mental tests. 
that are general in scope and educational tests that are specific in 
their reference. Also, it turns upon purpose, the mental test empha- 
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sizing psychological processes and being set up to reveal them in 
terms of the score achieved, and the educational test emphasizing 
attainment in some subject or group of subjects. The point has a 
considerable practical importance. Attempts have often been made 
to determine the extent to which pupils in school are exerting 
. themselves, or working up to or below their capacities, by com- 
paring their showings on mental tests on the one hand and on 
educational tests on the other. If a child does well on a mental 
test and less well on educational tests, he is supposed to manifest 
a lack of motivation or seriousness, for his achievement is not 
what one might properly expect. This relationship between men- 
tality and attainment has even been reduced to a numerical index, 
called the Accomplishment Quotient, or A.Q., which is the ratio of 
educational age to mental age (Franzen). If a child’s educational 
attainment is not on a level with his mentality, this gives him an 
A.Q."of less than 100 and suggests that something is wrong. And 
in some schools a marking system has been set up based, not on 
direct achievement in the various subjects, but on educational or 
achievement ratings divided .by ratings on mentality. There is a 
good deal to say about this whole idea, but a full discussion will 
have to be postponed. So far, however, this much is clear: it must 
not be supposed that an intelligence rating and an educational 
rating are two independent variables. Mental tests and educa- 
tional tests have much in common and differ chiefly in generality 
and purpose. This in itself suggests very decided doubts about 
such methods of treating their results in general, and about the 
Accomplishment Quotient technique in particular. 

{Moreover, the distinction between mental and educational tests, 
never absolute, is tending to become more and more blurred at 
the present day. Educational tests are being devised that are 
increasingly broad and general in their reference. Such tests are 
directed not only to the memory processes and information, but 
to problematic thinking, the drawing of inferences from data, the 
application of generalizations to specific problems and situations, 
study habits and practices, appreciative insights, and the like. 
Their chief difference from psychological tests proper is that they 
utilize material from one area of subject matter only—apprecia- 
tive insights in the fine arts or literature, the drawing of inferences 
from the data of chemical experiments, problematic thinking in 
mathematics, and so on. But they are apt to stress Psychological 
processes very heavily.)A good recent example of an educational 
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test which is very general in its character is the battery known as 


the United States Armed Forces Institute Tests of General Edu- " 


cational Development. These instruments deal with natural 
science, the social studies, and the humanities. But instead of 
stressing specific content, they call for the interpretation of pas- 
sages of reading material in these fields. As a matter of fact, this 
generality—this tendency to stress psychological processes rather 
than highly specific knowledge and skill—has been made a point 
of criticism against some modern educational tests. It is said that 
they gloss over specific mastery in the area of subject matter con- 
cerned, so that an able person can do well even if he knows little 
about the subject. This is an objection often urged against com- 
pletion tests, which may easily reveal brightness or general intelli- 
gence rather than actual knowledge. 

Finally, there are emerging educational aptitude tests, which 
are borderline cases between the two types here under considera- 
tion. Thus in the Iowa University Placement Examination apti- 
tude tests are set up for entering freshmen in a number of subjects. 
The mathematics aptitude test, for instance, consists of number 
series completion, problems calling for spatial imagination, sym- 
bolic logic, and the interpretation of new and difficult mathe- 
matical reading. It is intended to reveal, not the subject’s present 
mastery of mathematics, but his capacity to attain such a mastery. 
In effect, one might call it an intelligence test with a very strong, 
mathematical slant. But here again a problem arises. It has been 
pointed out above that educational achievement tests heavily em- 
Phasizing the so-called higher mental processes may reveal men- 
tality rather than mastery of subject matter. On the other hand, 
an alleged mathematics aptitude test such as the foregoing may 
be greatly affected by previous mathematical training. A person 
of quite limited mathematical ability and promise who had done 
a good deal of studying in the field might easily outshine a bril- 
liant person who had done little or none. In this case the intention 
of the test—the revelation of aptitude or promise—would be to 
some extent defeated (Stoddard, 1928). 

So in connection with the contrast between psychological and 
educational tests, just as in connection with that between experi- 
ments and tests, it is apparent that lines of work are converging. 
This, of course, raises many problems as yet unsolved, but it is a 
significant tendency in present-day work. The distinction, in any 
case, between psychological and educational tests is pragmatic 
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rather than absolute. They differ in purpose and emphasis. They 
differ in generality. If what we want is a rating on mentality, then 
clearly a psychological test is the one to choose, because it is set 
up with this intention and because a person’s performance on it 
can be immediately interpreted in terms of his mental charac- 
teristics. But it is not true that such a test turns upon factors 
wholly or even very largely distinct from those that affect per- 


formance on an educational test. 


TypPES AND CLASSES OF PSYCHOLOGICAL TEsrs 


/ 
V (Existing psychological tests can be classified in a number of 


different ways, some of which are quite superficial, whereas others 
reveal deep and far-reaching differences. 


LL Psychometric and projective tests 


{ The most fundamental classification of existing tests is into 
psychometric and projective instruments. There is between them 

a far-reaching distinction both in purpose and in methodology. 
‘The purpose of a psychometric test, as the term itself indicates, 
is to reveal or measure the amount of some mental trait or charac- 
teristic possessed by the subject. The purpose of a projective test, 
on the other hand, is to reveal the quality or type of the subject’s 
personality. Thus it is not, properly speaking, an instrument of 
measurement at all. A projective method for the study of per- 
sonality involves the presentation ofa stimulus situation designed 
or chosen so that it will mean to the subject not what the experi- 
menter has arbitrarily decided that it shall mean . . . but rather 
he person who gives it, or imposes upon 


Whatever it must mean to t s it, or im 
it his private idiosyncratic meaning and organization” (Sargeant, 


B-257). 


(With regard to methodology, the psychometric test sets up 


stimulus situations to which definite predetermined values have 
been assigned. Thus the series of numbers 2, 4, 6, 8 is presented, 
and the task is to indicate what the next number ought to be. Tt 
the response “10” is forthcoming, the subject receives a certain 
designated score. Or a vocabulary list of words in order of increas- 
ing difficulty is given orally, the task being to assign proper mean- 
ings to the various words. If a subject can define twenty of them, 
he comes up to the mental age level of eight. If he can define 
thirty, he comes up to the mental age level of ten. Or lists of 
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questions are set up, dealing with personal feelings and reactions 
—whether one is shy, or daydreams a great deal, or dislikes finding 
his way in strange places. Affirmative or negative replies are 
assigned scores purporting to show the extent of introversion, or 
dominance-submission, or the like. This illustrates what is meant 
by saying that a psychometric test is an instrument of measure- 
ment.{A projective test, on the other hand, may present as stimulus 
situations a set of ink blots or a set of Pictures, the task being to 
give a free interpretation of them in answer to such an inquiry as 
“What could that be?” The subject is at liberty to say anything 
he likes, and is in fact encouraged to do S0, pains being taken to 
avoid influencing him in any way. Obviously his responses do not 
result in measurement, but it is believed that they can be inter- 
preted as revealing the characteristics of his personality, his emo- 
tional trends and blockages, and type of disposition. 

Projective instruments of one kind and another have been in 
use for a considerable time, particularly in clinical psychology 
and psychiatric work. Nor are projective elements by any means 
Wholly absent from psychometric tests. The manner in which a 
child responds to an individual intelligence test is always con- 
sidered an important indication in addition to the actual content 
of his response. But it is only rather recently that projective test- 
ing has received Widespread attention. The literature on projec- 
tive testing is already very large, and it has developed into a 
special, extensive, and rapidly expanding field which cannot be 
treated adequately in anything less than a volume devoted spe- 
cially to it. But although it lies beyond the scope of a book which, 
like the present, deals Primarily with psychometric testing, it is 
important here to clarify the distinction between projective and 
psychometric instruments. They belong to different categories. 
Misunderstanding of this fact can only lead to drastic misinterpre- 
tations. 


2. Classification by types of process 


A less fundamental but still important and serviceable basis for 
the classification of mental tests is in terms of the kind of mental 
Process they purport to reveal. Here no single scheme is Uuniver- 
sally accepted and agreed upon in all details. There are tests of 
general intelligence. There are tests of special aptitude, manual, 
mechanical, vocational, and the like. There are tests of special 
talent, such as musical or artistic ability. There are tests of 


vp 
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interest. There are tests of personality, dealing with such charac- 
teristics as introversion, sociability, and the like. There are tests 
of attitude towards specific school subjects or specific races, or 
towards any school subject or any race, and so on. There are tests 
which deal with determining values, such as aesthetic or practical 
reactions to life and its problems. There are tests of moral traits, 
such as honesty and fair-mindedness. 

Just what classification in this respect is adopted is not of the 
first importance. The designated mental processes are not well 
defined. A talent is not clearly differentiated from an aptitude. 
Attitudes, values, and moral traits merge into one another. The 
meaning of the various terms is far from clear. All that is needed 
is a grouping of existing tests that offers some convenient frame 
of reference so that it is possible to know fairly well to what a 
person is referring. 


3. Classification by types of items 


First there are tests that are verbal in content. The stimulus 
consists of words, including of course mathematical symbols. The 
task is to manipulate words or symbols. The required response is 


“verbal. Then there are tests that are prevailingly nonverbal, which « 


present pictures to be interpreted or matched, blocks with which 
to build indicated designs, form boards with holes of various 
shapes into which the corresponding cutout pieces are to be fitted, 
boards with numerous holes into which pegs are to be placed, and 
So forth. Such tests are not entirely nonverbal, because the in- 
structions at least are given orally. All of them are sometimes 
lumped together as performance tests, i.e., tests which call for 
manipulation rather than verbal response. This, however, is not 
Correct usage and leads to serious confusion. Test items of the 
kind described can be used to reveal some general mental trait, 
usually intelligence. In this case we have a true performance test. 
Or such items can be used to reveal manual or mechanical ability, 
and in this case the test should be classified as an aptitude test. 
Then once again there are true nonlanguage tests, in which the 
task is to manipulate or compare or arrange objects or follow 
directions, and the instructions are given in some sort of panto- 


mime to avoid the use of speech. 


4. Classification by mode of administration 
In this connection two types of tests are found. First there are 
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individual tects, so called because they are administered to only 
one subject at a time. Such tests have the decided advantage of 
making possible oral as well as written or manipulative responses. 
Outstanding instances are the various revisions of the Binet scale, 
and the Kuhlmann Tests of Mental Development. Then there are 
group tests, so called because they can be given to groups of sub- 
jects simultaneously. One sometimes finds the term scale used as 
though it meant the same thing as individual test. This, however, 
is quite erroneous. Not all individual tests would be thought of 
as scales. And not all scales are instruments for purely individual 
administration. 

This discussion of classification will serve to explain certain 
terms commonly used in the field of mental testing, and to present 
some idea of the scope of modern mental testing. It is evident 
that all groupings except the first are chiefly pragmatic and do not 
turn on any very clear-cut or fundamental differences. 


LIMITATIONS AND VALUES OF MENTAL TEsTs 


During the past forty years vast experience in the application 
of mental tests has accumulated, and almost innumerable research 
investigations regarding them have been conducted. This provides 
a broad and firm basis for summarizing with a good deal of cer- 
tainty both their limitations and their possibilities. 


1. Types of test items 


A good way to approach this question is by a survey of repre- 
sentative types of items that have been and are being used in test 
construction. As Binet himself showed many years ago in the 
piece of informal research described at the beginning of this 
chapter, the very feasibility of testing turns upon the devising and 
organizing of stimulus situations which are significant and reveal- 
ing. The demonstration that this could be done—that specific 
items tould be assembled to indicate general mental characteristics 
— provided the impetus that started the modern testing movement 
on its course. To make anything like a complete catalog of such 
items wouid certainly be a very large and difficult undertaking. 
Fortunately it is not necessary. All that is needed is a Survey suf- 
ficient to make their general nature clear. 

Although enormous numbers of tests have been published and 
used, there is a strong family resemblance among the items out 
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of which they are built. Since the early work of Binet, whose first 
tests were published in 1905, a vast pool of such items has been 
accumulated. Successive test makers have drawn upon this pool 
again and again, often adding minor variations, sometimes pro- 
posing definitely new departures and novelties. Great ingenuity 
has gone into this work. But the general outcome has been in the 
direction of uniformity rather than of wide variety. 

Among the items used in verbal intelligence tests, the following 
are some of the most familiar. 

Verbal opposites, such as: black—blue, light, white, dark. (Un- 
derline the word among the last four which is opposite in meaning 
to the first.) 

Analogies, such as: shoe—foot ; glove—head, wrist, leg, hand. 
(Indicate the word among the last four which has the same rela- 
tion to the third as the second has to the first.) ' 

Best reasons : Several alleged reasons are listed for a course of 
action, a belief, etc., the task being to indicate the best one. 

Disarranged sentences, such as: turn gas go if the I stove off 
out the will. (Interpret correctly.) 

Sentence completion: Sentences with a word or words omitted, 
to be filled in so as to make sense. 

Proverbs: Familiar proverbs presented for interpretation, which 
may be given orally as in an individual test, or chosen from a 
list of possibilities as in a group test. 

Series Completion: An arithmetical or algebraic series is pre- 
sented, the task being to indicate what the next logical term should 
be. Sometimes the subject must decide on the proper continuation 
directly, and sometimes he must make a choice from several listed 
Possibilities. Ee 

Directions : Instructions to be followed, sometimes in large overt 
action, such as doing a number of things in the room in a desig- 
nated order, as in an individual test; sometimes with paper and 
pencil, such as following a chart or a maze course in a designated 
BE bln : Items in “objective” form, usually multiple choice, 
to indicate the extent of the subject's information about widely 
£ i ings. 
en have been rather widely used. In indi- 
Vidual tests they take the form of digits or sentences with a stated 
number of syllables that the subject is required to repeat ver- 
batim, or paragraphs or episodes of which the subject must give 
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the gist after hearing them read once. Another type of memory 
item is to see how many of the objects in a collection the subject 
can name after looking at them for a given time. Items of this type 
appear with appropriate modifications in a good many group tests. 

Arithmetic problems: The use of such problems, given either 
in verbal or numerical form, is a favorite resource in test con- 
struction. 

Vocabulary: Vocabulary and word knowledge items are con- 
sidered to be of great value. A common procedure is to set up a 
list of graded stimulus words and to see how many of them the 
subject can roughly define. 

Classification: A stimulus word followed by several alternative 
suggestions as to the classification of the thing indicated, the task 
being to pick out the right one. 

As instances of the type of primarily nonverbal items, widely 
used in intelligence tests, the following three are typical. 

Draw a man: The subject draws a schematic picture of a man 
On instructions to do so, and is rated on his showing the proper 
number of limbs, etc., and on indicating perspective and propor- 
tion. This is a novel type of item and has been used in only a very 
few tests. 

Space relationships: See Figure 1 A. The task ‘is to determine 
What numbers are only in the circle and what number is in all 
three figures. [ 

Cube relationships: See Figure 1 3. The task is to decide how 
many cubes there are in such piles. 

Aptitude, personality, attitude, and talent tests naturally use a 
much greater variety of types of items. Here are a few typical 
instances. 

Number checking: Series of pairs of numbers, some the same 
and some different, the task being to indicate which are which. 
Intended as one of the measures of clerical aptitude. 

Mechanical relationships: Tracing the dynamic relationships in 
a complex machinery layout, such as a System of interlocking 
gears and pulleys shown in pictorial form. 

Mechanical assembly: Assembling various Objects from their 
disconnected parts—bells, paper clips, spark plugs, etc. The time 
the subject takes to assemble each object usually affects his score 
considerably. b 


Aesthetic comparison: Comparisons of grouped Pictures, visual - 


designs, literary excerpts, musical themes, etc., the task being to 
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indicate their relative aesthetic values. The test objects are some- 
times presented in pairs, sometimes in larger groupings to be 
ranked. 

+ - Personal questions: A very large number of types of personal 
questions are used in tests of personality, attitude, value, and 
interest. They have to do with the subject’s opinions about him- 
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self, his likes and dislikes, his views on various subjects, his inter- 
ests, and so forth. They may be thrown into a form that enables 
him to answer yes or no, or set up as choices between various 
Stated alternatives, or he may be asked to give numerical indica- 
tions of the strength of his feelings, beliefs, and the like. 
Projective tests usually present specific stimuli but give the 
Subject a great deal of latitude in making any response he desires. 
This, of course, is very far from a complete survey of all the 
types of test items in use. They are literally endless variations and 
combinations that have been tried out.* But so far as psycho- 
metric tests are concerned, they all have a common characteristic. 


See Pintner, 1931, Pp. 183-90, for a fuller account of basic types of items 
used in intelligence testing. 
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All of them are designed and set up in such a manner that the 
responses they elicit can be treated numerically ; that is, they can 
be counted so that a total score is possible. This procedure lies at 
the very heart of psychological testing as we know it today, and 
alone makes such testing possible. 


2. Limitations of psychological tests 


It must, however, be abundantly clear that tests built out of 
Such items are subject to very grave limitations. In order to gain 
a correct understanding of what such tests can and cannot do, it 
is very necessary to appreciate the nature of these limitations. 

A. They cannot directly reveal a person’s capacity for complex 
and sustained learnings. Such capacity is of the highest impor- 
tance, and it is beyond doubt one of the characteristics of all 
superior achievement. To learn a language well and rapidly, to 
master a science or a branch of mathematics, to become able to 
deal with complex practical situations such as confront the busi- 
nessman, or the strategist, or the surgeon, or the-engineer call for 
sustained energy, persistence, and very complex processes of men- 
tal organization. Yet our tests cannot deal with such processes, for 
the simple reason that they cannot be translated into items which 
are numerically manageable and susceptible of being counted. 
Perhaps, as it has been claimed, these processes always exist in a 
person’s behavior in some definite amount. But if there is no way 
of determining what that amount may be, the hypothesis is of 
little aid. . 

B. Our tests cannot directly reveal capacity for disentangling 
concepts from complex masses of data. Here again are mental 
processes of the highest importance and significance. They are 
central in much of what is best in human achievement, such as 
the slow and painstaking discovery of scientific laws and prin- 
ciples, the working out of proper operative techniques, decisions 
upon the right line of action in the midst of many and bewildering 
alternatives, and “straight thinking” generally. Yet once more they 
cannot be reduced to significant quantitative items. 

C. Our tests cannot directly reveal capacity for consistent and 
considered choice between possible courses of action. This capac- 
ity clearly involves such traits as persistence, judiciousness, and 
self-confidence, which are partly intellectual and partly ethical. 
The very fact that tests as we know them must, because of their 
nature, isolate intellectual and moral factors and set them Up in 
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limited and artificial situations means inevitably that they cannot 
deal with some of the most fundamental aspects of life and 
behavior. 

D. Our tests cannot directly reveal capacity for dealing sensibly 
and wisely with practical problems. A boy may be asked what he 
would do if he found an automobile on the street abandoned and 
unlocked with the keys in it. He may be scored on his response 
to three or four possible alternatives, but his answer may have 
very little relationship to his actual behavior on such an occasion, 
because the essential elements of temptation and opportunity are 
lacking. A man may be asked to assemble ten small objects or to 
trace a complex of mechanical relationships in a diagram, and 
he may be scored quite definitely on the result. But it would be 
very rash to ask him to adjust the carburetor of one’s car, or to 
repair a radio circuit simply on the basis of his showing. 

E. Our tests cannot reveal directly a person's capacity for con- 
trolled and effective methods of work. Good methods of work are 
One of the most decisive of distinguishing marks in a person of 
high achievement and ability. They are usually learned slowly 
and painfully over a period of many years and as a result of much 
experience and many contacts and suggestions. Moreover, they 
are highly individualized and must be suited to the person con- 
cerned, so that beyond a certain point they cannot be standardized. 
There is no way whatever to reduce such processes to a series of 
test items capable of adding up to a total numerical score. 

F. Our tests cannot directly reveal the depth, strength, and 
subtlety of a person’s appreciative reactions in ethical, social, or 
aesthetic matters. Such measures as we have looking in these direc- 
tions usually consist either of short questions for self-evaluation 
Or of objects such as pictures or poems to be compared in terms 
of aesthetic value. It is, however, doubtful whether such value- 
discriminations can be put fully in verbal form; and if they can, 
the expression of them would be very complex and full of quali- 
fications. As to direct comparisons between the aesthetic status of 
Pairs of objects, it is quite clear that the sensitive person will be 
AWare of and respond to all sorts of nuances which cannot show 
Up in such choices. 

‘1G. Above all, our tests cannot even begin directly to reveal 
Capacity for producing original ideas and constructions— for initia- 
tive, for the Original solution of problems, for creative endeavor. 
Indeed, the type of items used systematically discourages original- 
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ity and places every emphasis upon the production of the expected 
“correct” response. Thus it has been pointed out that if a child 
produced a brilliant but novel definition of a word given in a 
vocabulary test, thus indicating a very high level of mentality, he 
would nevertheless be scored zero on the item because his reply 
did not fit in with the standardized scheme of the instrument. 

In general tests built out of items such as have grown into 
general use cannot directly reveal the “higher mental processes”— 
the very processes most typical of human behavior at its best, on 
which the loftiest distinction and supreme achievement depend. 
By way of answer it may, to be sure, be said that our present psy- 
chometric instruments do in fact indirectly reveal the presence 
and to some degree the excellence of these Processes and capacities. 
On this premise, if men like Macaulay, or John Stuart Mill, or 
Darwin, or Leonardo da Vinci could be persuaded to submit to a 
well-chosen battery of tests, they would rank extraordinarily high, 
even though they certainly would not find an opportunity to show 
their power in all its fullness. There is much real force in this 
argument, and it is probably quite sound as far as it goes. Yet it 
does not wholly meet the Case; for the fact seems clear that 


and a concentration on the artificial and trivial, it cannot be denied 
that their claims have enough validity to warrant Serious attention. 


3. Values of psychological tests 


The other side of the picture is that despite limitations which 
every judicious student of the subject is bound to recognize, the 
modern testing movement has achieved great and indubitable 
Successes, both practical and theoretical. 

A. It is in connection with the practical uses of Psychological 
tests that the most-obvious and unanswerable case can be made. 
Always granted a proper choice of instruments and a Proper inter- 
pretation of the results they yield, many eventualities can be 


foreseen, and many costly errors in dealing with human beings 
eliminated. 
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Let us, for instance, consider the data presented in Tables 1 
and 2 which, be it noted, were reported a good many years ago. 
Table 1 shows a close relationship between performance on the 
Stanford Revision of the Binet scale* and educational pros- 
pects. All entering high school freshmen above a certain level on 
the intelligence test complete their course, and nearly all of them 
are in higher institutions five years after testing. None of those 
below a certain level are in either categories. Moreover, in the 
classification the relationship becomes still more evident. The 
top quarter of this same group of 107 students had intelligence 
quotients running from 119 to 142. Of them 100% finished high 
school, and 91% continued education beyond it. The lowest quar- 
ter of the group had intelligence quotients running from 709 to 97. 
Of them 37% finished high school, and 12% continued their 
education beyond it (Proctor, 1925). 


TABLE 1 


STANFORD-BINET I.Q.’S OF ENTERING HiGH SCHOOL PUPILS IN RELATION 
TO EDUCATIONAL PROSPECTS 


(After W. M. Proctor, 1925, P. 30 ff.) 


STANFORD-BINET I.Q.’s OF 107 ENTERING HiGH 
ScHooL PuPILs 


125 plus | 115-124 | 105-114 | 95-104 | 85-94 | 75-84 


Percents complet- 
ing high school 100 96 83 75 40 -) 


Percents in higher 
institutions five 
years after testing 


95 86 54 28 18 [) 


i hree individuals 
i der Table 2. Note that of the t 
in Anne 75-84 I.Q., no one gets an average mark above 
CL; whereas of the four in the bracket of 135 and over, no one 
5 


Binet scale was the first revision of the Binet 
EE under the auspices of L. M. Terman in 1916. 
ৰ) scale was the second revision, made also at Stanford 


* The Stanford Revisio 
tests made at Stanford U 
The Revised Stanford-Bin 
Under the same auspices in 1937- 
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gets an average mark of less than C+. Moreover, if the facts in 
regard to an individual's intelligence level are known, much can 
be done to help and guide him. Twenty-two students entering 
high school were given the Stanford Revision of the Binet scale in 
the second half of their eighth-grade year and, on the showing 
they made, were given special guidance. Their later work was 
compared to that of 109 unguided high school students comparable 
in intelligence. Of the guided group 18% made one failure in a 
high school course, whereas 31% of the unguided group made one 
failure. None of the guided group made two or more failures, 
whereas 11% of the unguided group did so (Proctor, 1918). These 
investigations were made more than two decades ago, and they 
have been selected for this very reason. Their findings are typical 
and have been confirmed in substance and amplified time and 
again since then, in many and varied connections. Clearly, then, 


TABLE 2 


STANFORD-BINET I.Q.’s OF 131 HIGH SCHOOL STUDENTS IN RELATION TO 
AVERAGES OF ALL HiGH SCHOOL MARKS 


(Quoted from W. M. Proctor, 1925, Table 5, p. 47) 


AVERAGE OF 
ALL HicH STANFORD-BINET I.Q.’s 
ScHooL 
MARKS 75-84 | 85-94 | 95-104 |105—114 |115-124 125-134 135 up | Totals 
A [) [) [-) 3 4 4 I I2 
B+ [) +) 2 5 8 4 I 20 
B [) 7 19 8 12 } I 54 
C+ ; 5 5 4 2 [) I 18 
6 I 4 6 3 ধু [-) [o) I5 
Dp 1 5 3 I oJ e) [) Io 
E 0 I $n ) e) o) oJ 2 
Totals hE 36. | 24 21 Eg af 


a mental test score reveals something very well worth knowing, 
even though it is not always decisive and is hedged about by 
many admitted limitations. 

As another example may be cited the large body of work that 
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has been done on the use of psychological tests in connection with 
college entrance. The first necessity is the selection of a suitable 
test, sufficiently difficult to reveal differences in mental ability 
among the highly selected individuals who present themselves for 
rating. An instance of such a test is the Thorndike Psychological 
Examination for High School Graduates. Experience and investi- 
gation have decisively shown that such an instrument, which takes 
about three hours to administer, will discriminate well between 
those likely to succeed in their college course and those apt to have 
difficulty or to fail (Thorndike, 1920). It will, in fact, predict 
Success much better and more reliably than the pattern of subjects 
taken for college preparation, and on the whole just about as well 
! as the student’s average mark in all his subjects during his high 
school career. And the test in this case, as pointed out, requires 
but three hours, whereas a high school course takes four years! 
Here is another finding, reported years ago, and consistently con- 
firmed. 

Moreover, when one such test is given year by year for a period 
of years to all students applying for admission, a college can 
J determine a critical score below which success is unlikely (Wood). 

Such a score must be determined by the individual college, because 
standards vary so greatly among different institutions as to make 
the establishment of a general level applicable everywhere impos- 
sible. The college, however, may admit candidates who fall below 
the critical score if other factors are unusually favorable—if they 
are serious, hard-working, and have a superior character record. 
Thus psychological tests do not tell the whole story by any means. 
But there can be no doubt that they furnish highly important and 
valuable data which have the great advantage of being more or 
less an independent estimate of the persons concerned. 

To take a more recent example, a specially designed test, the 
Medical Aptitude Test, was for many years prepared and used 
annually in connection with the admission of candidates in many 
leading medical schools under the auspices of the American Asso- 
ciation of Medical Colleges. Reports and data were exchanged 
between the Association and the various institutions using the 
test, so that it was systematically improved, and better and better 
interpretations of the showings it revealed were built up. Its gen- 
eral value is clear from the data summarized in Table 3. As will 
be seen, the scores when divided into decile steps had a very 

~ definite relationship to average grade in medical school and to the 
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probability of failure. In view of the expense of medical educa- 
tion, such prediction is obviously important (Kandel, Moss). The 
fact that within the last year or two the individual schools have 
taken over responsibility for testing admission candidates, so that 
the nationwide program has been discontinued cannot impair these 
findings or the general significance of the instrument itself. 
Finally, reference must be made to the very extensive use of 
psychometric tests of mary kinds during World War II. The 


TABLE 3 


RELATIONSHIP BETWEEN TEST SCORES ON MEDICAL APTITUDE TEST AND 
ACHIEVEMENT IN MEDICAL SCHOOL 


(Kandel, p. 17) 


Decile Test Score Percent of Failures Average Grade 
I ! 85.0 
2 8 82.4 
8 8 81.9 
4 be) 8I.I 
চি Io 80.8 
6 I2 80.3 
7 14 79-8 
8 18 78.5 
9 19 * 77.8 

Io 25 এ 76.4 


varied testing programs in the various branches of the armed 
forces well exemplify both the values and problems of Ppsychologi- 
cal measurement, alike in their development, their procedures, 
their numerous successes, and their occasional failures (v. Davis, 
1943; Guilford, 1943; Stalnaker, 1945). To cite the selection of 
aviation cadets as a specific instance, during the period from 1924 
to 1941 anywhere from 45% to 75% of trainees were rejected at 
some stage in their training, usually at an early one. A sequential 
testing program, designed to reveal first general fitness, and then 
special qualification for specific duties, was an important factor in 
er this high proportion of waste very markedly ( Flanagan, 
1942). 

The guidance counselor, the clinical psychologist, and the PSsy- 
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chiatrist, too, find numerous and important uses for psychological 
tests. Such workers employ tests not so much for exact and final 
measurement, but for the refinement of observation. No person 
experienced in work of this kind will make himself the slave of 
test data. But he will utilize tests for the observation and assess- 
ment of behavior under controlled conditions (Cornell and Coxe, 
Rapaport, Gill, and Schaefer). 

So far, then, there is a clear case. Experience and investigation 
have consistently shown that good tests, properly used and conser- 
vatively interpreted, are exceedingly valuable instruments. What- 
ever their limitations, they do in fact provide important informa- 
tion about the capabilities and prospects—vocational, academic, 
and personal—of human beings. Moreover, they do so quickly. If 
it is possible to obtain in the space of anywhere from 30 to 180 
minutes an estimate of a person or a group of persons, which will 
at least roughly correspond to reality, it is hard to deny that the 
devices for so doing have justified their existence. 

What the clinician, or educator, or personnel worker, or voca- 
tional counselor does about the test data when obtained is, of 
course, another story. The evidence yielded by such material is 
only partial. In this respect psychological tests resemble the lab- 
Oratory tests of the physician. He would not wish to be without 
them ; but once they have been applied, he proceeds in terms of his 
knowledge of the total picture and of his general outlook and 
experience. But that the data are worth having, and within limits 
highly significant, nobody can deny. 

B. Few critics of mental testing have sought to combat such 
claims as these, for their authenticity is altogether too patent. But 
not a few have argued that tests are limited to purely practical 
and ad hoc values. Thus Thomas (g.v.) insists that they are built 
on wholly unsound premises from which they should be freed and 
that they should be considered simply as “engineering instruments. 
for purposes of evaluation” (p. 83). 

It certainly seems peculiar at the very least to recommend instru- 
ments as useful, practical tools while saying at the same time that 
they are theoretically quite unsound. Indeed, the suggestion might 
Very well be that there must be something unsound about the 
theoretical position of the critic. But the point may be laid aside 
for the moment. 

It is perfectly true, as there will be many occasions to observe, 
that many common assumptions in connection with tests are at. 
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least questionable and often untenable. Many confident assertions 
have been made about heredity, environmental influences, the 
process of mental growth, and the distribution of mental abilities 
with the claim that they are logically based on test results and 
so have a valid scientific foundation. Yet quite often such asser- 
tions are indefensible. Psychological testing is relatively new. It 
has been remarkably successful. It has about it an aura of scien- 
tific precision that has led to many rash, brash, and hasty con- 
clusions that go far beyond the established facts. But to dismiss 
Psychological testing as nothing more than a pragmatic bag of 
tricks is a very great mistake. It certainly does not reveal with 
any finality or completeness the nature, organization, and action 
of the human mind. Indeed, it throws considerably less light on 
these ultimate questions than a good many people were once 
inclined to suppose. But it does provide a methodology and a 
growing accumulation of data that the theoretical psychologist 
cannot legitimately ignore. 

Many years of patient research lie ahead, and many lines of 
work must be pushed onward until they converge before the true 
general significance of what tests are revealing becomes apparent. 
But new techniques of analysis are emerging, one conspicuous 
instance being factor analysis. And to deny that tests have any 
theoretical significance or basis because their orientation has been 
in the first instance practical and because they have not yet 


uncovered the central mystery of Psychology is both untenable 
and unintelligent. 


Minar. Tests AND PsycHoLocIcAL THroRry 


Critics of mental testing, centering on the limitations which 
have been considered above, have argued that they are due to an 
‘unsound psychological orientation. The contention is as follows. 

Existing psychometric instruments are said to be based upon 
the presuppositions of an atomistic or mechanistic Psychology. Of 
necessity they undertake to isolate and measure separate abilities, 
such as general intelligence, interest, mechanical aptitude, socia- 
bility, musical talent, and the like. There seems no other pro- 
cedure, so far as can be seen at the present time. These abilities 
are thought of as independent unitary functions, and the indi- 
vidual human mind is, at least by implication, regarded as the sum 
total of these units which exist in it in ascertainable amounts. 
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Also, these separate abilities or functions are thought of as operat- 
ing in test situations as they do elsewhere. Thus the higher levels 
of mental organization and action, such as originality, initiative, 
creative power, capacity for continuing and persistent learning, 
and so forth, are composites of simpler and measurable elements. 
The higher mental processes are admittedly important, and admit- 
tedly they cannot be tested directly; but their components can 
be measured directly by existing means, at least to a considerable 
extent. In effect, the position is that we cannot directly measure 
the unique characteristics which make up the mind of an Einstein 
Or a Raphael; but we can get a set of figures which, within the 
technical limitations of our instruments, will authentically repre- 
sent the powers and functioning of such a mind, because its com- 
ponent abilities can be tested. Such, it is said, is the psychological 
theory on which psychometric procedure depends. 

This whole viewpoint, however, it is argued, is erroneous. It is 
diametrically opposed to the organismic or holistic or configura- 
tionalist psychology coming more and more into prominence. The 
individual mind is precisely not a composite of unitary traits or 
abilities, but a functioning unit. Intelligence, for instance, cannot 
be separated from interest. What is called musical talent, or artis- 
tic talent, or mechanical aptitude is not a sort of special faculty, 
but is essentially the mind or personality as a whole operating in 
a particular way. Moreover, it is false to claim that a person wil} 
behave in the same manner in a test situation as he does elsewhere. 
For instance, a fairly typical question which occurs in a certain 
test of “practical judgment” is: What is the right thing to do if 
You find you are going to be late for school? The answer that a 
child makes on the test blank may have little relationship to what 
he would actually do in such a situation if it really happened, for 
the personality as a whole responds differently in different situa- 
tions. 

This entire line of argument is found expressed with different 
degrees of completeness in the work of many writers, but it has 
been brought together with particular effectiveness by Thomas 
(g.v.). The general conclusion to be drawn is that if the criticism 
holds good, then projective instruments which call for free per- 
Sonalized reactions to the stimuli presented might be well founded. 
But psychometric instruments would be founded on sand. Also, 

homas points out that the configurationalist or holistic psychol- 
08y, notably as expressed by John Dewey, has been the basis of 
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the progressive movement in education. So he holds that there is a 
fundamental inconsistency between psychometric testing and the 
whole body of doctrine and practice that goes by the name of 
Progressive Education. 

Up to a point the argument is convincing. Certain workers in 
the field of mental testing have at any rate seemed to express 
themselves in the language of a so-called atomistic psychology. 
They may be right or they may be wrong, but in so far as they 
have put forward a general viewpoint, it is perfectly legitimate 
to criticize them. The point is that such a viewpoint is not, as a 
matter of logical necessity, the true basis of psychometric testing. 

The ultimate consideration is that our best tests—and they are 
numerous—really do work. They have validated themselves in 
actual practice, not perfectly to be sure, but quite well enough to 
be of great service. They do not tell the whole story, and those 
Who make and use them should always bear this in mind. But 
they do tell an important part of it. Good intelligence tests really 
do indicate capacity for academic education, though they are less 
certain indicators of business and professional success. Medical 
aptitude tests, which are essentially intelligence tests with a medi- 
cal bias, really do foretell success in medical school and during 
internship, though their relationship to a man’s‘success as a prac- 
ticing physician is another matter. Measures of interest are closely 
related to effectiveness and satisfaction in many vocations. The 
best measures of personality and temperament have real clinical 
value, differentiate well between stable and unstable individuals, 
and are of definite service in the diagnosis of mental aberrations. 
All this surely indicates that such instruments must be to some 
extent at least psychologically sound. To argue, as Thomas and 
many critics actually do, that they are theoretically nonsensical 
yet practically useful is a violent anomaly. 

The truth of the matter is this. The attack is made against 
views which may be premature, or extreme, or incautious, but 
which are essentially irrelevant, and not the necessary basic psy- 
chological theory underlying our psychometric testing. Any test 
must be built about some concept of the thing to be tested. Such 
concepts as general intelligence, neurotic tendency, a prevailingly 
aesthetic outlook upon life, honesty, and mathematical ability are 
typical examples. Tests are constructed to try to measure each of 
them, and if we have no idea what our working concept is, and 
do not isolate it at all, how is it possible to build a test directed 
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towards it? But there is not the slightest necessity to assume that 
Such a concept corresponds to an ultimate or “real” component of 
the mind. It is only a guiding thread, a working hypothesis, set 
up for the sake of a job. Sometimes it seems to prove up. Some- 
times it does not. If our hypothesis yields an effective and work- 
able pattern of test items, and if the resulting test turns out to be 
useful and significant, then the concept about which it is organized 
is to that extent validated. If we really knew how the human 
mind is organized, there is no doubt that we could make much 
Detter tests, and perhaps they would look quite different from our 
present instruments. But this knowledge is not available, and the 
best that can be hoped for is the emergence of good working 
hypotheses. 

Thomas and others have complained that the working concepts 
about which tests are set up are merely of the order of “common 
Sense” ideas. Workers in the field of testing have taken over such 
Popular notions as that people have different degrees of intelli- 
gence, or artistic ability, or assertiveness, or motor coordination, 
tried to define them with at least a little more precision than is 
found in their ordinary use, and then proceeded to build tests 
around them. Surely, it is said, this means that psychometric in- 
Struments involve very superficial psychological insights. This is 
true enough, but nearly all the concepts of modern psychology are 
Pretty much on the level of common sense as Thomas and other 
critics understand the term. Often those concepts may be dressed 
Up in impressive language, but this does not change their nature 
Or make them any more profound. The term “retroactive inhibition” 
may startle the layman, but all it means is that if one learns one 
thing and then proceeds to learn another, the second job of learn- 
Ing may obliterate the first. Psychoanalytic concepts, of course, 
Seem to be of a different order, but they are at least open to 
Considerable question and have not been assimilated into general 
Psychological usage. The truth is that mental testing is neither 
behind nor in front of the general development of the science of 
Psychology. It operates at just about the prevailing general level 
of psychological investigation today. It is an important technique 
of demonstrated value. But those who devote themselves to it 
cannot be expected to produce operating concepts enormously in 
advance of our present understanding of mental life. 

In all this there is nothing that requires a mechanistic psy- 
chology or that need be unacceptable to adherents of an organismic 
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psychology. Granted that the personality always acts as a unit, and 
that it acts differently in different total situations, this does not 
absolve us from the task of analysis. We still need to try to under- 
stand what goes on, particularly if our interest happens to be prac- 
tical. The concepts of the Copernican astronomy are no doubt 
better than those of the Ptolemaic astronomy, but they certainly 
fail to do justice to the organic unity of the cosmos. They do not 
explain ultimate reality. They merely offer guide-lines for research 
and practical adjustment. So with mental testing at the present 
time. Its working concepts are much better than those of faculty 
psychology and phrenology, in which the practical outcome was 
the feeling of people’s bumps. If critics complain that mental tests 
do not as yet reveal the ultimate nature and functioning of the 
mind, one must admit that this is all too true. But it seems irrele- 
vant to the work in hand. 

The matter may be put as follows. The testing movement is, in 
effect, based on the proposition: Let us think of mental life in 
terms of such and such concepts. Let us build workable tests 
about them, and see what happens. What happens is partial suc- 
cess, and partial failure. What else could be expected, except per- 
haps complete failure? The basic concepts can be modified, refined, 
and reorganized. Their translation into ordered patterns of test 
items can be improved. These undertakings are, indeed, well under 
way, notably by dint of the statistical techniques of factor analysis. 
‘Thus it appears that a great variety of test responses can be 
explained in terms of a relatively small number of “factors,” such 
as verbal ability, inductive reasoning, quantitative thinking, spa- 
tial thinking, and the like. And attempts are being made to con- 
struct tests which will reveal such factors, and nothing else, e.g., 
to reveal quantitative thinking in its purity, without any irrele- 
vancies. Such attempts at conceptual refinement and definition are 
exceedingly significant and have much promise. But they do in- 
volve the danger of mistaking working hypotheses for metaphysical 
claims, and of regarding such factors as self-existent psychological 
entities instead of guidelines for the improvement of testing. This, 
however, is no matter of logical necessity, and if it be avoided, 
then no organismic psychologist or progressive educator has the 
right to object to psychometric tests as theoretically untenable 
or intrinsically “mechanistic,” though he is abundantly right in 
hoping for better ones. 


CHARACTERISTICS OF MENTAL TESTS 29 


SUGGESTED ADDITIONAL READINGS 


Ethel L. Cornell and Warren W. Coxe, A performance ability scale 
(Yonkers-on-Hudson, N. Y.: World Book Co., 1034), Chapter 1, 
“Functions of a performance scale.” An excellent discussion of the 
Values and limitations of testing. 

John C. Flanagan, “The selection and classification program for 
aviation cadets (aircrew—bombardiers, pilots, and navigators),” 
Journal of consulting psychology, 6 (1942), 229-39. Excellent account 
of a successful testing program. 

Lawrence G. Thomas, Mental tests as instruments 0f science (Evan- 
ston, Ill.: American Psychological Association, 1942); also Psycho- 
logical monographs, vol. 54, no. 3, whole no. 245 (1942). Hostile 
analysis of psychometric testing. Coordinates widespread criticisms. 

E. L. Thorndike, and Others, The measurement of intelligence (New 
York: Teachers College, Columbia University, Bureau of Publica- 
tions, 1927). To be scanned as source of numerous “atomistic” claims. 

Philip E. Vernon, The measurement of abilities (London: The 
University of London Press, Ltd., 1940), Chapter 10, “Hints to teach- 
ers.” Excellent practical advice which throws light on the nature of 
tests. 

L. W. Webb and Anna Markt Schotwell, Testing in the elementary 
school (New York: Farrar and Rinehart, Inc., 1939), Chapter 7, 
“Uses of tests.” Good practical material. 


QUESTIONS FOR DISCUSSION 


‘I. From an examination of sample tests assemble as many different 
kinds of test items as you can find. Does it seem to you that any of 
them might offset the limitations of testing discussed in this chapter? 

2. See if you can invent some test items which might work out in 
the measurement of general intelligence, mechanical aptitude, musi- 
cal ability, or any function you choose. This should give you an 
insight into the nature of tests and their construction and use. 

. 3. Make an outline of the argument presented by Thomas, par- 
ticularly in his fifth chapter. Document his criticisms of testing as 

atomistic” by references to the reading in Thorndike. 

4. Does the position taken by Cornell and Coxe seem to any 

egree to meet the criticisms summarized by Thomas? 

5. Have you encountered in general reading, conversation, or else- 
Where any criticisms of testing not mentioned by Thomas? Can you 
find any replies? 

& 6. Look over a number of sample tests. Do you think that their 
titles might be in any way misleading? What questionable conclusions 
might they suggest? 
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7. Is the fact that tests “york out” evidence in favor of the sound- 
ness of their underlying theory? 

8. What do you think of the use of memory items in intelligence 
tests? 

9. See if you can find out whether your own institution has any 
data on file regarding the value of psychological tests for entrance, 
guidance, and so forth. Collate and discuss. 

10. It is suggested that if possible one or more group psychological 
tests be administered in class, and the results used as a basis for dis- 
cussion in studying this and succeeding chapters. 


A 


CHAPTER II 


PSYCHOLOGICAL TESTS AS INSTRUMENTS OF 
MEASUREMENT 


THE CoNDITIONS OF MEASUREMENT 


Any measuring device whatsoever must fulfill four definite con- 
ditions if it is to be of service. This is true of foot rules, balances, 
thermometers, speedometers, and also of psychometric tests. The 
four conditions must be met or the device will be misleading and 
useless. In regard to projective tests the issue is not so clear, for 
they are not considered to be devices for measurement, at least 
in the ordinary sense, and we shall find that there is some dif- 
ference of opinion regarding the criteria by which they must be 
judged. As to psychometric tests proper, however, the case is rea- 
Sonably clear. The values and limitations of such tests depend far 
more upon these unavoidable requirements than upon any theo- 
retical orientation. 

The four conditions in question center on the same thing—the 
making of a measuring device which really measures, which yields 
authentic, dependable, serviceable results within the limits of its 
applicability and meaning. There are four classes of possible error 
Which can afflict any instrument of measurement whatsoever. But 
While it is necessary to consider them one by one, it is also impor- 
tant to see that all of them are interrelated and that they affect 
One another in many Ways. 

1. All measurement is subject to constant error. Suppose, for 
example, that the workers in a test laboratory wished to secure 
Curves showing the fluctuations in the temperature of an engine 
under operating conditions. They would install a device to take 
readings from the appropriate instrument, namely the thermom- 
eter or thermometers placed appropriately in contact with the 
machine. It might be, however, that a clumsy mechanic hooked 
Up the recording mechanism to a thermometer not placed in con- 
tact with the engine at all, but perhaps showing the temperature 
of the outside air. When the readings came in for investigation, 
the curves would have a very peculiar look. They would have 
nothing at all to do with engine temperature, or at any rate only 
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a very vague and undetermined relationship to it. And they would 
be quite useless for purposes of investigation, or as a basis for any 
kind of practical decisions. 

In such a case the error is very wild and obvious. Yet errors 
just as wild, but unfortunately much less obvious, are entirely 
possible in the field of mental measurement. We might, for in- 
stance, have something called an intelligence test that called for 
nothing but routine memory responses, or for some very special 
knowledge or skill, or that was so easy that everybody scored 
100%, the only difference being the speed at which the subjects 
worked and the time they took to complete the assigned task. Or 
we might have something called a test of artistic talent in which 
the subject was called upon to show whether he could make a free- 
hand drawing of a straight line.* In such instances there would 
be falsification or error, due to a constant deflecting influence, 
and any set of scores that might be obtained would be invalid, 
Which means that it would not represent the ability supposed to 
be tested. 

So any measuring device must be valid. This is its first necessary 
characteristic. It must measure just what it purports to measure, 
and so far as possible nothing else. In other words, it must measure 
without undue constant error or deflection. 

2. All measurement is subject to variable errors. These are errors 
Which come from accidents and inaccuracies, and they are due to 
many causes. Every amateur carpenter who has ever tried to cut 
wood into proper shapes and sizes for a job of work is all too well 
aware of the possibility of variable error. He probably uses a valid 
instrument, namely a foot rule, but he may not apply it quite accu- 
rately, or he may read it wrongly. When he has found the right 
place on the board he is using, he may not draw his pencil line 
quite properly. And when he picks up his saw and proceeds to 
make his cut, there are Plenty of chances for him to spoil his work. 
Perhaps he tries to be very accurate and checks up on what he is 
doing. To that end, instead of making only one measurement, he 
repeats each of them three or four times. But the chances are that 
at each repetition he will come out at a different place on his board. 
In fact if extremely fine standards are applied, he is sure to do so, 
for each separate measurement has at least a modicum of unre- 

* This is not as impossible as it may seem. Just such tests of artistic ability 


ar actually given by not a few teachers, although the item is not used in any 
published instrument known to the writer. 
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liability. The thing to hope for and to try for is measurement 
accurate enough for the purpose in hand. It must be more accurate 
for a living-room table than for a sawhorse, and more accurate 
still if the job is the construction of an airplane engine. But it will 
never be wholly and completely exact. 

For all measurement is to some extent afflicted with variable 
error. Special instruments and devices are employed to reduce this 
variable error. Perhaps our amateur carpenter invests in a miter 
box, which costs him something, but enables him to feel a good 
deal more confidence when he starts cutting into his wood. If he 
had to get his work accurate to one thousandth or one ten thou- 
sandth of an inch, the miter box would not be much use, and he 
would have to invest in some very expensive instruments indeed. 
He would do so in the interest of accuracy or reliability, and to 
get away from the “personal equation” and variable errors gen- 
erally. 

The same problem is present in mental measurement. A certain 
test is given to a group of children, and they are ranked on the 
result. Then a second equivalent form of the test is given to them, 
and it is found that there is a certain amount of change in their 
rankings. Which set of results is the right one? There is no way 
of telling; for if the test has three forms and the third is adminis- 
tered, another slightly varying set of scores will appear. However, 
if the variation is only slight, it may not matter a great deal for 
practical purposes ; but if it turns out to be extreme, then none of 
the scores are usable, and the test itself is condemned as too 
unreliable for service. So a psychometric instrument must have a 
Serviceable degree of reliability. This is its second characteristic. 
It means that variable error, which can come from many sources, 
has been held down to a reasonable and workable degree. 

3. All measurement is subject to personal errors. These, of 
Course, are a type or subclass of chance, or variable, or accidental 
errors, but they are important enough to be considered by them- 
selves, at least in connection with psychological testing. 

To return to our amateur carpenter for an illustration, he may 
be on the job of cutting a fourth table leg to match the other 
three. He lays his rule on the wood and makes his markings. But 
he is tired, or shaky, or bored, or something distracts his attention, 
and the result is an error. An element of subjectivity has vitiated 
the measurement, due to the condition of the person Who is making 
it. Our carpenter can reduce the chances of such personal errors 
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by using guides and a power saw. Then he is very likely to get all 
four table legs just about the same length. (They will not be abso- 
lutely so because of variations in the instrument, but they will be 
close enough.) Or if he lacks such equipment, he can get somebody 
else to check up on his measurements. In both cases he will secure 
a higher objectivity. 

In mental measurement the chances of personal error are very 
much greater than those with which the carpenter has to contend, 
for one thing because there is quite apt to be a strong element of 
bias. Thus teachers are apt to overestimate the intelligence of 
dull children, and to underestimate the intelligence of bright ones, 
because dull children are usually overage and large for their grade, 
Whereas bright ones are usually underage and small. Also, of 
Course, personal feelings, liking and disliking, prejudice and 
favoritism constantly disturb estimates. And when it comes to 
assessing a person’s attitudes towards such matters as race or 
War, or his neurotic tendencies, personal errors are always a major 
threat. Thus good mental tests must be constructed to embody the 
same principle as the guides that help the carpenter. They must 
be devised to guard as far as possible against besetting personal 
errors. In other words, they must be made objective. This is the 
third of their basic characteristics. 

4. All measurement is subject to errors of interpretation. Con- 
sider, for example, a map of the world in Mercator projection. 
Unless one is careful, it can lead to most misleading notions, be- 
cause the units marked off by the lines of latitude and longitude 
change as one goes from the equator to the poles. Australia looks 
much smaller than the United States, yet really it is just about the 
same size. Greenland looks much farther from the North American 
mainland than it actually is. Air distances between remote points 
are distorted. 

The selisame difficulty exists in connection with mental measure- 
ment. And it exists in more serious and exaggerated form because 
many people are not aware of it and because the precise formula 
for correction is not known, whereas in connection with the 
Mercator projection map it is known. Two tests are given to the 
same group of subjects. One of them yields scores of 50, 75, 100, 
and 132, among others. How should they be interpreted? Is 175 
good, bad, or medium? Is the score of 100 double the score of 50, 
in terms of what it really means? The second test yields the score 
of 132 among others. Has this the same meaning as the score of 
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132 On the first test? Questions of this kind constantly arise in the 
application and use of mental tests, just as they would do if we 
Were making estimates of distance on a Mercator projection map 
of the world and on a globe. 

Now the Mercator projection map is standardized on a known 
Principle. That is to say, its units have a definite and ascertainable 
meaning so that if any one is misled, it is entirely his own fault. So, 
too, there must be some way of assigning to the units of measure- 
ment set up in any psychological test a known and constant sig- 
nificance. They need not be equal, though it is more convenient if 
they are. That is to say, the difference between a score of $0 and 
90 need not be exactly the same as that between a score of 120 and 
another of 130, or the difference between a mental age of eight 
and one of nine need not be equal to that between a mental age of 
twelve and one of thirteen. But the nature of the variation, if 
there is one, must be understood. This is just the situation with 
the Mercator projection map. Variation is less convenient and 
more risky than equality and is more apt to deceive the unwary. 
But so long as we know how to allow for it, there is no fatal objec- 
tion. Clearly, however, the meaning of the units of measurement 
set up in whatever instrument may be under consideration must 
be known, or it is unusable. That is to‘say, the instrument must be 
standardized. This is the fourth of its necessary characteristics. 

Ttiis important to understand that the four necessary character- 
istics of any measuring instrument are not independent, so that 
Validity, reliability, objectivity, and standardization are interlock- 
ing aspects of the same thing, namely, the process of accurate and 
Serviceable measurement, which avoids gross errors of all kinds. 
The balance of this chapter is devoted to considering in some 
detail the major technical problems involved in these four aspects 
of measurement. This furnishes a necessary background for the 
discussion of numerous tests and testing procedures presented in 
Chapters 3 to 8 inclusive. Then, after this body of concrete mate- 
rial has been set forth, we shall return once more to the topic, par- 
ticularly in the two final chapters of this book, and deal with some 
of the wider issues involved in the concepts of validity, reliability, 
Objectivity, and standardization, and in the theory of measurement 
generally. This movement from more specific, and procedural, and 
technical considerations to those that are more universal com- 
mends itself as a means for gaining a grasp of the logic of measure- 
ment, its implications, possibilities, and limitations. Bute as the 
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subject is approached, the first thing to see is that the above four 
characteristics are interrelated, and that they are essential, since 
without them there could be no such thing as measurement. Essen- 
tially, the whole modern testing movement has turned on the 
discovery of. ways of dealing with mental processes by instru- 
ments reasonably valid, reliable, and objective, and fairly well 
standardized. It is the fulfillment of these four conditions far more 
than any general theoretical basis which accounts for both the 
values and the limitations of mental tests. 


VALIDITY 


How validity is secured and how its degree and extent are 
determined can best be explained by reference to an actual 
instance, the construction of the well-known Henmon-Nelson Test 
of Mental Ability (v. Henmon and Nelson). 

The authors of the test started with a general working concep- 
tion of what they meant by “general mental ability,” as it mani- 
fests itself in pupils ranging from grades 7 to 12. The concept was 
not very explicitly defined, but as a matter of experience and prac- 
tice it was sufficiently explicit for the building of a test. Their 
first working step was to assemble test items which seemed rele- 
vant, including vocabulary, sentence completion, series completion, 
disarranged sentences, classification, anagrams, directions, proverb 
interpretation, and arithmetical reasoning. After a process of 
accumulation and selection they had got together a list of 250 such 
items in all. These were submitted to experienced teachers for 
criticism and judgment and 48 were eliminated, leaving 202 for 
further study. These again were set up as a test in two forms 
which were tried out on about 500 students. Those with the best 
predictive value—that is, those which gave the highest correlations 
with the entire battery and which discriminated best between 
students known to be high and low in ability—were retained, the 
total number now being reduced to 180. These were built into the 
two final forms of the completed test. This new test was then 
tried out with new groups of students, who were also given five 
other intelligence tests of recognized excellence. The new test 
correlated with one or other of these five tests from .72 to .88. 
This is a fairly representative procedure in the construction of a 
valid test, though there are many variations. 

There are three essential phases of the process. 
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1. A working concept of the function or process to be tested 
must be set up. As we have seen, Henmon and Nelson did not find 
it necessary to undertake a novel or elaborate specification of the 
nature of “general mental ability,” but a certain working idea was 
necessarily involved in all they did. In a case like this the idea 
would be implicit in what one might call the tradition of making 
general intelligence or general ability tests. Other authors, particu- 
larly when they are originating some departure from standard 
practice, have undertaken to define their working hypothesis quite 
carefully. Thus Binet considered general intelligence as turning on 
three characteristics—the power to take and maintain a definite 
direction in thinking, the power to make adaptations in order to 
attain a goal, and the power of self-criticism (Terman, 1016, P. 45). 
Thorndike, in building his I. R. Intelligence Scale CAVD, thought 
of intelligence as having four attributes. These were as follows: 
(a) Level, or altitude, i.e., the degree of difficulty that a person 
Can reach in the performance of mental tasks; (b) range, i.e., the 
number of different tasks one can perform at a given level of diffi- 
culty; (c) area, i.e., the number of different tasks one can perform 
at all levels of difficulty one is capable of reaching; (d) speed. 
He considered altitude by far the most important of the four, and 
proceeded to build his scale with reference to it. The scale in 
question consists of a series of graded tasks of four kinds—com- 
Pletion, arithmetic, vocabulary, and directions—beginning at a 
very easy level and continuing to a point which can be attained 
by very few human beings (0. Thorndike and Others, 1927). To 
Pass to another field, Voelker (g.v.) proposed to construct a test 
to measure the trait of «trustworthiness. Evidently he had to 
define this trait in order to begin assembling relevant test items. 
For his purposes ‘he regarded trustworthiness as a tendency to 
abide by instructions without supervision or checkup. These are 
typical instances of the first phase of test construction ; namely, 
the isolation and definition of a concept which it is hoped will 


Prove effective. 
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or some other trait which will show up in his behavior. And the 
first step is to clarify the concept and to define its meaning when 
it expresses itself in observable action. Without this, observa- 
tion lacks orientation and is fruitless (Olson and Cunning- 
ham). 

This phase of the process of making a valid test has been receiv- 
ing increasing emphasis in recent years, due to the wide use of the 
techniques of factor analysis. Factor analysis, as we have seen, is 
a statistical procedure (or array of procedures) designed to pro- 
duce tests which measure certain designated mental factors and 
nothing else. Thus the California Test of Mental Maturity is a 
battery intended to measure immediate memory, delayed memory, 
spatial thinking, logical reasoning, numerical reasoning, etc. In the 
manual the authors say, in effect, that its validity turns primarily 
On its capacity to measure precisely these defined and specific 
factors. In other words, they rest their case, first and foremost, 
upon a very careful analysis and defining of the basic concepts 
underlying the instrument, this analysis being achieved not by 
mere conjecture, but by a closely controlled statistical technique. 
Validity as so interpreted has been called “factorial validity” 
(Guilford, 1946), i.e., the ability of a test to reveal some specific 
and designated component of mentality without any irrelevance. 
There is even a tendency at the present time to regard factorial 
validity as sufficient in and of itself, but this is open to very grave 
question. 

All this raises a fundamental question which has already been 
Considered in a somewhat different connection... Do the traits Jr 
factors we propose to observe or to test really exist? Do our work- 
ing concepts correspond to actual Psychological entities? McCall 
has stated the issue in a basic proposition which asserts that what- 
ever exists can be measured. By this he means that no matter how 
refined, or spiritual, or exalted, or evanescent a mental function 
may be, if it exists at all, it must exist in some quantity; and if it 
exists in some quantity, there is always, in theory at least, the 
possibility of ascertaining just what that quantity is. General 
intelligence and trustworthiness, which have already been men- 
tioned, would be two illustrations of what this proposition means. 
But it refers just as much to honesty, aesthetic appreciation, crea- 
tive power, unselfishness, sociability, spatial thinking, logical 
reasoning, and indeed to the whole vast range of functions of which 
the human mind is capable. 
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One might, of course, object that it is one thing to admit the 
theoretical measurability of any given trait or factor and avery 
different thing to carry through the actual job of measuring it. 
But the relevant point here is somewhat different. We try to meas- 
ure or to observe many kinds of traits and functions and factors. 
But do they actually exist? Are they actual mental and behavioral 
entities? We see a child perform certain actions when he is witha 
group of other children and say that he manifests friendliness. We 
find that he makes a certain score on a vocabulary test and claim 
that this indicates a certain degree of general intelligence. We 
find that he is able to manipulate various spatial designs in a test 
situation, and claim that this indicates a certain capacity for spa- 
tial understanding or thinking. So with all other such traits and 
functions. Must we, then, believe that these traits, and functions, 
and factors have a “real” or metaphysical existence? The assump- 
tion is certainly not necessary from the standpoint of test con- 
struction and validation. The only necessary hypothesis is that the 
behavior which observation records or the test score registers is 
indicative of how the individual will behave elsewhere and under 
other circumstances. When it is said that a child has proved him- 
self friendly in terms of certain observed actions, the implicit 
assumption is that he will prove himself friendly under other 
circumstances and in other types of action. When his high vocabu- 
lary score is taken as a sign of high general intelligence, the state- 
ment really means that he will deal intelligently with quite differ- 
ent problematic situations. When he succeeds in manipulating a 
set of geometrical designs, and thus scores well on “spatial rela- 
tionships,” this amounts to saying that he will be competent in 
handling spatial problems elsewhere. In so far as the working 
concepts which direct observation and testing are well chosen, 
and in so far as they are translated into relevant ways of behaving 
or relevant test items, this assumption is likely to prove true. This, 
however, does not necessarily mean that the concept corresponds 
to some real entity, but only that the sample of behavior that it 
isolates— the observed actions, or the test responses—is significant ; 
i.e., symptomatic of the behavior of this individual in other cir- 
cumstances. rt 

2. The second phase in the building of a valid instrument of 


‘mental measurement is to assemble and select test items which in 


the experience and judgment of the maker are likely to involve the 
trait, or characteristic, or function as conceived. That is to say, the 
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working concept must be translated into a series of specific 
responses elicited by properly chosen stimuli. 

‘This, of course, is a vital requirement, and it is not always any 
too well fulfilled. Thus many of the items in the Binet scale and 
its various revisions seem to correspond pretty well with the nature 
of intelligence as he conceived and described it. But others do not. 
The vocabulary test, for instance, which is always regarded as 
very important and valuable, would appear to depend a good deal 
more on past experience than upon the ability to follow a line of 
thinking, to adapt oneself for the sake of a goal, and to criticize 
one’s own endeavors. In the same way, such items as the repeti- 
tion of digits and of syllables certainly seem remote from intel- 
ligence as conceived and defined. Turning to the Thorndike L.E.R. 
Intelligence Scale CAVD, this is a test which stays very close 
indeed to its stated working conception, for it is expressly built to 
measure altitude of intellect by a graded series of tasks. With 
regard to Voelker’s attempt to construct a test of trustworthiness, 
which has also been referred to, he devised many interesting and 
ingenious items to ascertain whether a Subject would be likely to 
abide by instructions without supervision. Some of them consisted 
of deliberate overstatements, such as asking a boy whether he had 
received a mark of 95 in arithmetic. Another consisted of requiring 
the subject to push a button every two minutes for a given period 
of time while a record was made to see if he obeyed. In yet an- 
other, the subject was given a page of arithmetic problems on the 
reverse side of which appeared what purported to be the answers. 
Some of these answers, however, were Wrong, so that it was pos- 
sible to tell whether any copying was done. 

It is usual in the construction of a test to bring together a much 
larger pool of items than can be used. The maker’s own judgment 
and experience begin the work of refinement and selection. Then 
other capable and experienced persons are called in to carry the 
work further. In the case of the Henmon-Nelson test, the help of 
a group of experienced teachers was secured in deciding which 
items seemed most suitable. In the case of the L.E.R. Intelligence 
Scale CAVD and of the two Stanford Revisions of the Binet scale, 
the working staff who collaborated in the test construction consti- 
tuted in effect a jury. 

It is usually at about this point that technical statistical analysis 
begins to be introduced. The tentatively selected test items are 
tried out, often on quite large groups of persons similar to those 
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for whom the completed test is intended. Ttems tend to be selected 
which give the highest possible correlation with total scores on 
the tentative test. The point of this is to be as certain as possible 
that the instrument will be internally consistent and that it will not 
contain material irrelevant to itself. Another criterion used is the 


discriminative power of the items. The trial subjects are rated by 


some external criterion, such as estimates by teachers, record of 
tests, on the trait or function 


schoolwork, or other similar existing 
to be measured. Items that lead to similar responses on the part of 
Superior and inferior subjects are eliminated, and only those that 
Show marked differentiation are retained. Thus in the construction 
and revision of the Binet scale, a criterion constantly used for the 
retention of subtests was that they should show improved scores 
When given to children in higher age groups. Yet another method 
coming prominently into use is factor analysis. Tryout tests are 
given. Then an analysis is made of the scores and their interrela- 
tionships which shows that they are explainable as due to the 
operation of a limited number of more or less well-defined mental 
factors. Then the items are rearranged and the tests reorganized 
so that each test in the battery measures one and only one such 


factor. 


These are standard and accepted methods for getting a suitable 


set of test items for the measurement of some trait or function. 
However, the actual choice of items is much influenced by consid- 
erations that are not entirely explicit. Only those stimuli which will 
yield scorable responses can be used. It would be useless to try to 
test a person for such a trait as nervousness by showing him a 
horror motion picture and noting his reactions. Also, it would be 
useless to try to measure mental ability by presenting the subject 


with a complex mass of historical or scientific data and asking him 
to come back later with conclusions. In both these situations the 
response elicited might be admirably indicative, but it would be 
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against facts. This certainly applies to the determining concepts 
around which tests are built. So validation against outside criteria 
is altogether necessary if we are to have tests on which confidence 
can be placed. Yet, as will appear, this external or “practical” 
validation is a good deal less decisive than might be expected. 

A. One frequently used criterion is that afforded by other tests 
of similar order. This may be considered a minimum requirement 
in proper test construction and evaluation. Henmon and Nelson, 
in comparing their completed test with five others, report correla- 
tion of .68 to .77 for groups of subjects consisting of college 
students, correlations of .77 to .88 for high school groups, and 
correlations of .54 to .90 for elementary school groups. Repre- 
sentative correlations betwen four well-known group tests are 
presented in Table 4. Quite often one of the revisions of the Binet 
scale is utilized as the chief external criterion for validation. 


TABLE 4 


CORRELATIONS OF FOUR GROUP TESTS WITH ONE ANOTHER AND WITH 
Two OTHER CRITERIA, FOR FRESHMAN HIGH SCHOOL STUDENTS 


(Quoted from C. C. Ross, Table 3, P. 77) 


Detroit Kuhlmann-| Terman rae 

Advanced | Anderson | Group Mental 
Detroit Advanced ........ 88 .86 81 
Kuhlmann-Anderson ...... i887 94 486 77 
Terman: Group. cnc ani eos 86 86 78 
McCall Multi-Mental ..... .8I r) ‘78 
Average other 3 tests...... .91 88 .90 -83 
Average Freshman Grades. . 46 ঃ .56 42 
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As a validation procedure, this clearly leaves many open ques- 
tions. (a) It shows in general how well the new test conforms to 
accepted standards in the area concerned. But if this conformity 
is not very great, there is always the possibility that it may be due 
to the superiority of the new instrument. (b) Much of the agree- 
ment which is indicated may be due to a strong family resem- 
blance among the kinds of items used in tests of the same general 
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category. If one inspects a number of tests of intelligence, per- 
sonality, interest, and attitude, it is not hard to find prima facie 
reasons why they will probably agree pretty well. (c) A fairly 
high degree of agreement no doubt indicates that the new test is 
based on pretty much the same working conception as those against 
which it is checked. But if it is desirable to improve, refine, and 
Better delimit basic working conceptions, this may not be without 
its disadavantages. So in general, the procedure leads towards uni- 
formity and safety, but not towards the experimental betterment 
of instruments of mental measurement. 

B. Quite often a test is checked against other tests which pur- 
port to measure different functions. If there is a high level of 
agreement, this indicates that the difference is apparent and 
nominal rather than real. Thus it has been found that certain tests 
of interest show quite a close relationship with inclusive batteries 
of educational achievement tests. The conclusion is that in spite 
of the difference in name and in content, what is being measured by 
the two types of instruments is to a considerable extent the same. 
Then it becomes a matter of convenience and expediency which to 
use in a given situation. On the other hand, what is often found 
is marked disagreement. Tests of mechanical aptitude and talent 
tests, such as the Seashore Measures of Musical Talent, usually 
show very low although positive correlations with scores on tests 
of general intelligence, This, of course, means that the two instru- 
ments being compared are measuring widely different functions. 
It is, however, necessary to be very careful in putting forward 
interpretations of such findings. ‘To say that musical talent or 
mechanical aptitude, have nothing to do with intelligence would 
not be admissible, for the tests in question, like all others, are 
simply based on working concepts which may or may not corre- 
spond to psychological realities, and indeed probably correspond 
to them quite vaguely at the best. 

This validation procedure may be regarded as a practical win- 
nowing device. It enables us to group and subdivide tests as deal- 
ing with similar or dissimilar functions. Such knowledge has con- 
siderable importance in planning a testing program in which it is 
desired to get as wide-ranging a sample of mentality as can be 
had. But it can hardly be taken as proving anything about the 
organization of the mind, or the true relationship of human 
abilities. 

C. Very frequently tests are validated against achievement in 
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school. Here the most widely used criterion is that afforded by 
school marks. Thus Jordan (g.v.) reported the following correla- 
tions of test scores against high school average marks: Army Intel- 
ligence Examination Alpha 38, Otis Self-Administering Test of 
Mental Ability .49, Terman Group Intelligence Test 47. Henmon 
and Nelson, in their validation procedure, found a correlation of 
60 for their test against composite high school marks. The rela- 
tionship between the Medical] Aptitude Test and marks in medical 
college has been reported in Table 35D: 20; 

As a validation criterion, school marks Obviously leave a great 
deal to be desired. Even a composite mark has considerable unre- 
liability. And as an average it is made up of components usually 
unspecified, and each with a weighting which is not reported. The 
same composite mark would presumably mean two quite different 
things if it were made Up of separate marks in Sshopwork, music, 
English, and typewriting, or of separate marks in algebra, geom- 
etry, trigonometry, Latin, and Greek. The reason why the criterion 
is so widely used is chiefly that it is about the only readily avail- 
able numerical rating to be obtained on large numbers of persons. 
Another and by no means negligible reason is that the widest use 
of tests is in educational guidance, and if they do not indicate and 
foretell school achievement expressed in terms of marks, their pur- 
Pose will be defeated. 

A secondary criterion quite often employed is that furnished by 
teachers’ ratings. It has been frequently used in connection with 


tion of mentality than can be Provided by. estimates made by 
teachers, and to use the latter for Proving up the former seems to 


of Mechanical Aptitude were checked against ratings by teachers 
of manual training and science. Many tests dealing with conduct, 
attitude, and personality have been checked against the opinions 
of teachers regarding the behavior of groups of subjects. There is 
a very good chance that such ratings have a real significance. The 
functions are reasonably definite and Observable. The teachers are 
often specialists. They are likely to have fairly intimate contacts 
vith their pupils. It is quite believable that their Opinions about 
the mechanical competence and general behavior of these pupils 
are reasonably accurate. 
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D. Various out-of-school criteria have been used for purposes 
of test validation. One quite frequently used is vocational success. 
In dealing with intelligence tests this criterion is not much em- 
Ployed. It seems promising, but as a matter of fact it is decidedly 
treacherous. Just what is meant by success? Can it be registered 
in terms of salary? Does a salary of $5,000 indicate the same level 
of success in the Methodist ministry and in banking? How can 
Success in different vocations be compared? Is success an indica- 
tion of genuine competence, and particularly of competence in 
some specified mental characteristic? Problems of this kind are at 
least very difficult if not insuperable, and the result has been that 
the criterion has proved almost unworkable for most types of tests 
(v. Bingham, pp. 229-35). One good instance of vocational valida- 
tion is that provided by the Strong Vocational Interest Blank for 
Men. It has been shown that those who succeed or continue in a 
given vocation are much more likely to show a strong interest in 
it than are persons in general. 

For tests of personality traits, and also for some intelligence 
tests, commitment to institutions for the feeble-minded or for the 
insane is used as a validation criterion. So also is later psychiatric 
diagnosis. These are probably about the best criteria we possess 
because they involve carefully considered expert opinion and 
responsible action. As will be pointed out later, the best personality 
tests measure up fairly well in this respect. This seems to 
indicate that they are built around well-chosen concepts and 
that these concepts are translated effectively into really indicative 
items. As to intelligence tests, the use of such criteria has shown 
that the two revisions of the Binet scale are inferior as clinical 
and diagnostic instruments to certain other individual scales, 
notably those of Wechsler and Kuhlmann. This, however, does not 
mean that the Stanford revisions may not be superior in other 
respects. 

E. A number of recent developments, notable among which 
Was the wide use of mental tests in World War II, have called 
increasing attention to the problems and techniques of validation. 
The most important single question about any test must always 
be that of its validity, but when the selection of personnel for 
various types of war service was involved, the matter became very 
urgent. Psychologists were very definitely challenged to produce 
instruments of mental measurement which would definitely justify 
themselves and yield indubitable results (v. Staff, Personnel Re- 
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search Section, Classification and Replacement Branch, the Ad- 
jutant General’s Office, 1943 ©). J. G. Jenkins (1046 b, p. 93) has 
summarized the effect of this situation as follows: “The events of 
World War I taught American psychologists the necessity of 
validation. The experience of the next two decades taught them 
much about the technique of validation. It remained for World 
War II to drive home to psychologists the necessity for devoting 
much time and thought to the basis of validation.” In passing, it is 
interesting to remark that while mental tests were widely used in 
the German Airforce, the problem of their validity was largely 
ignored. Apparently German psychologists relied upon what is 
often called “face validity,” which is usually understood to mean 
prima facie or “common-sense” validation, without any careful or 
controlled checking or investigation (v. Mosier, 1947). 

The two most important considerations which have emerged 
largely as a result of war experience are, first, the need for a care- 
ful analysis of the criterion, and, second, the need to determine the 
degree of validity required if a test is to be usable. (a) First, it is 
recognized that the criterion must be analyzed. The general cri- 
terion of “success” can be extremely misleading. We propose, 
perhaps, to set up a test to select competent typists. If it is to be 
validated against success in typing, however, we must remember 
that one typist may be fast but inaccurate, another slow but 
accurate, etc., so that there are many kinds of “success.” The only 
way to meet this situation is to construct a test which will yield a 
profile of the various factors of success and failure in typewriting 
(Toops, 1944). Again, the criterion of success can easily be unre- 
liable. For instance, the number of hits scored by an aircraft 
gunner under training conditions may seem a very good indication 
on which to base a predictive test, but as a matter of fact this num- 
ber fluctuates enormously. Or again, it was found that ratings given 
to trainees on check flights by experienced instructors showed al- 
most no agreement at all. And in psychological experience in the war 
it very frequently appeared that a test which would predict success 
in training had very little relationship to actual field and combat 
performance (Jenkins, 1946). So it is that school grades, grades 
during training, ratings by instructors, and even output on the job 
are all dubious as criteria unless they have been very carefully 
analyzed (Stuit). (b) The second question that has become promi- 
nent has to do with the degree or level of validity that must be 
present. In place of the general statements that used to be accepted, 
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it is now recognized that this depends largely upon the use to 
which the test is to be put. Ordinarily, it would be said that a 
correlation of .50 against an established criterion indicated a low 
validity, and raised doubts about the usability of the test, at least 
without qualifications and other indices. But if the problem were 
to pick out the top 30% of prospects for some job or assignment, 
a correlation of .50 with the criterion would yield 74% of good 
choices (Taylor and Russell). In other words, one cannot make 
general statements about the required validity of a test. One can 
only say that the more selective the decisions to be based upon it, 
the lower the correlations can safely be, and conversely .* 

Such, in schematic outline, are the methods used in establishing 
and determining the validity of psychological tests; that is to say, 
in reducing the constant errors of the instrument. Clearly, such 
Procedures are all of them trial-and-error experimental processes 
leading to few absolutely clean-cut opinions. As a matter of fact, 
the ultimate validation of any test is to be found only in its wide 
and serviceable use. The basic conceptions are never perfectly 
clear. The test items are never perfectly relevant, nor can they 
reveal all the significant aspects of the function in question. The 
criterion is never wholly assured. This means that constant errors 
Can never be wholly eliminated. As we have seen, the question has 
been raised whether tests, as we know them, can ever have any 
validity at all for the reason that. mental traits and functions can- 
not be isolated. The answer clearly is that the process of analysis 
On which any test is necessarily built is far from perfect but is 
Significant. For, taking the picture as a whole, our best tests have 
established at least a working and serviceable validity. And if 
everything in the science of psychology except what was suceptible 
Of precise definition and indubitable proof were thrown out, how 


much would remain? 


RELIABILITY 


“If unchanging subjects are measured twice with a perfectly 
5 


reliable instrument by a perfectly reliable agent, the correlation 


betw, f scores is 1.00” (Walker, Pp. 265). This 
een the two sets 0 hat is meant by reliability of 
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ment yields the same result. Each single set of measures can be 
* Taylor and Russell have published tables showing the precise relationship. 
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completely relied upon. Variable, i.e., accidental errors are elimi- 
nated. Such a condition is not attained even by the finest of 
physical instruments. Astronomical observatories, laboratories 
using the most advanced devices, factories manufacturing the most 
accurate appliances never achieve this final perfection. Variable 
error is always present to some degree, and allowance for it must 
always be made. This is much more conspicuously the case with 
psychometric instruments. But if there is to be measurement at 
all, the ideal must be approached to a reasonable degree. The 
tolerances must be fine enough for the practical purposes desired. 
Thus psychological tests are constructed to attain the best possible 
reliability, or at least a reliability sufficient for the ends to be 
served. 

In order to understand how this is done and what it involves, 
three points must be considered. They are first, the major causes 
of unreliability in mental testing and their avoidance; second, the 
methods used to ascertain the actual reliability of psychometric 
instruments; third, what can be accepted as sufficient reliability 
to make a test serviceable. As with validation, it will be found 
that the whole subject is less conclusive than is often sup: 


Posed, or than cursory and elementary accounts frequently sug- 
gest. 


1. Causes of unreliability and their avoidance 


The causes of unreliability may be classified as those which are 
in the test itself, those which are in the person who takes it, and 
those which are in the person who gives it (Symonds, 1928; 
Walker, p. 258). 

A. First consider those causes of unreliability which are in the 
test. 

(a) Other things being equal, an increase in the length of a test 
will increase its reliability. The reason is that the addition of more 
and more items yields a better sampling of the subject’s true ability 
and gives him a better chance to show what he is able to do. A 
person may do unusually well or unusually badly with a very few 
items, but if there are a great many of them, such variable errors 
tend to cancel out. 

However, reliability does not increase in direct proportion to 
the increasing length of the test. For instance, if the test is made 
twice as long, its reliability is not doubled. The formula for com- 
puting the effect of increased length is as follows: 
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ৰ Nr 
* I+H(N-—1)r 
Where r; is the new reliability coefficient, r the original reliability 


coefficient, and N the multiple by which the length is increased. 
So if the length is doubled and the original reliability is .70, the 


new reliability becomes: 


= 170 
Or if the length is tripled, the new reliability becomes: 
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Therefore, increase in length h 
Upon reliability. Clearly then, on a 
cerned, there is always a question as 
in the length of a test to secure a sma 


Worth while. j a 
(b) Irrelevant or disturbing elements in a test tend to lower its 


reliability. It is because reliability depends to an important degree 
Upon the pertinence of the contents of a test—upon their rele- 
Vance to what is to be measured—that mere increase in length 
Cannot be considered independently as a favorable factor, and 
must be thought of in relation to other matters as well. The follow- 
ing are among the most important causes of irrelevance or disturb- 
ance that occur in psychological tests. f 

(i) Wide range of difficulty. Other things being equal, a narrow 
range of difficulty tends to increase reliability. When much of the 
content of a test is too easy and much of it is too difficult, many 


Of the items do not evoke significant and revealing responses, and 
of measurement for which it is 


are thus irrelevant to the task ich 
intended. (ii) Scaling. Other things being equal, a test with items 
Scaled in ascending order of difficulty will be more reliable than 
One with items in random order, or with the most difficult ones 

rst. The reason is that items in ascending order of difficulty tend 
to evoke a more consistent and significant response, since the sub- 
ject is not disturbed by abrupt and irrelevant frustrations. Some 
Of our best tests are constructed in this way, for instance, the 
LE.R. Intelligence Scales CAVD. (iii) Item independence. Inter- 
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dependent items, i.e., those which present the same problem in 
different forms, tend to lower reliability. The reason is that the 
form of the item tends to effect the response, and this is essen- 
tially an irrelevance. (iv) The incidence of chance. The use of 
items which involve important elements of chance tends to lower 
reliability. Thus the two-choice item, of which the familiar true- 
false item is an instance, and which involves a theoretical so% 
chance, is the least reliable form item by item. Four- or five-choice 
items, in which the element of chance is greatly reduced, are more 
reliable item for item. For this reason it is standard practice to 
increase the total number of items where the two-choice type is 
used to compensate for item unreliability. It is best not to make 
correction for chance by right minus wrong scoring, to have at 
least fifty such items, and to combine them with items of other 
types. (v) Catch questions tend to lower reliability, because they 
introduce an obviously irrelevant element. (vi) Emotionally loaded 
items decrease reliability. For instance, if a test contains items on 
race problems, it may be less reliable in a community with strong 

‘“ racial prejudices than when they are absent. Again, in some tests, 
unpleasant, or disturbing, or ill-printed, or unintentionally funny 
Pictures may be used, which clearly involve irrelevance and dis- 
turbance. 

The above may be considered the chief empirical elements of 
irrelevance that often lower test reliability. The techniques of 
factor analysis, however, go beyond these matters. Their purpose 
is to produce tests that are “factorially pure,” i.e., entirely and 

\ completely relevant to the concept about which they are built. It 
may be presumed that factorial purity will have an important 
‘effect in raising reliability. 

As far as the test itself is concerned, then, its length and its 
relevance are the two considerations that determine its reliability. 
How necessary it always is to consider both, and to work for a 
balance of influences is well shown by the Wonderlic Personnel 
Test in comparison with the Otis Self-Administering Test of Men- 
tal Ability, of which it is an adaptation. The Otis Test takes 
thirty minutes, and the Wonderlic Test but twelve, but the latter 
is approximately as reliable as the former, a presumptive cause 
being a high degree of relevance in its construction. 

B. The chief causes of unreliability which must be attributed 
to the person taking the test, i.e., the subject, are as follows. 

(a) The more common and familiar the subject finds the experi- 
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ences which make up the content of the test, the greater the relia- 
bility is likely to be, other things being equal. It is often suggested 
that mental processes, and more particularly general intelligence, 
ought to be tested by means of items quite beyond the range of 
normal experience for the sake of avoiding the influence of the 
environment and for the isolation of hereditary factors. Such a 
test, however, would be inconceivable, and it is much better frankly 
to use items turning upon common experience on the assumption 
that they will reveal true differences. This consideration is of 
great importance when one proposes to administer a standard test 
to members of a racial group for whom it was not specifically 
Planned. The unfamiliarity of the items may greatly lower the 
reliability of the test. To cite another instance, Army Intelligence 
Examination Alpha, developed and used during World War I, 
came under serious criticism when it was given to civilians, be- 
cause it drew rather heavily on special military experience and 
knowledge. 

(b) The mental set of the subject is always highly important, 
Any uncontrolled variations of mood and atitude at once lower 
the reliability of the measurement. Thus it is found that a test run 
late in the school year is apt to yield more reliable scores than it 
would if run soon after school opened in the fall, particularly with 
the younger pupils. The reason, of course, is that as time goes on 
the children achieve a balanced orientation and are able to accept 
the test situation without being disconcerted, or amused, or an- 
noyed, or otherwise disturbed. There is a special problem here in 
connection with tests given to preschool children and infants who 
are unused to such experiences. The scores may have too low a 
reliability for any use because the negativism and shyness of the 
subject disturb the testing. Willingness to cooperate and the 
avoidance of distractions are factors of major importance in the 
securing of reliable test scores. 

(c) Other things being equal, the use of practice material in a 
test before the subject comes to deal with the actual items is likely 
to increase reliability. But here again a balance of factors must be 
kept in mind. The American Council on Education Psychological 
Examination for College Freshmen has been adversely criticized 
because, in spite of its many excellent points, it comprises 19 
minutes of practice and 33 minutes of actual testing, which seems 
quite disproportionate. Some tests, for instance the Humm-Wads- 
worth Temperament Scale, actually include a great many “dead” 
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items, i.e., items that are not scored although the subject responds 
to them just as he does to the others, partly for the sake of main- 
taining a good attitude. What effect this has on reliability is not 
known, so far as the writer is aware. 

C. Causes of unreliability in the agent, i.e., the person who gives 
or administers the test, are as foliows. 

(a) Inaccurate or prejudiced scoring. Obviously if the opinions 
or feelings of fatigue or boredom of the agent enter in, the relia- 
bility of the test is decreased. Most tests are set up with con- 
siderable safeguards against such eventualities. More and more of 
them are appearing with scoring stencils or machine-scoring de- 
vices. But it should be noted that the excessive mechanization of 
tests, although very convenient and tending towards the avoidance 
of certain types of variable error, tends also to restrict the range 
of items. Test items are not notable for their flexibility at the 
best ; but when they are set up so that the only response the sub- 
Ject needs to make is to punch a hole in the proper box with a 
stylus, which is a common arrangement in machine-scored instru- 
ments, they are still further restricted. 

(b) If the agent or person who gives the test, because of lack 
of understanding, or skill, or proper care, gives insufficient or 
varying directions and instructions to those who are taking it, the 
effect is definitely to lower the reliability of scores. Most test 
manuals insist very strongly that the exact wording of the test 
instructions must be followed, the exact time allotment observed, 
and all details carefully handled. When it comes to individual 
tests, the importance of proper administration is greatly increased. 
To handle such a test properly requires very considerable thought, 
study, and instruction and not a little practical experience. 

(c) If the agent fails to establish effective rapport with the 
persons who are taking the test, the reliability of the obtained 
Scores is decreased. The mood, the willingness, the seriousness, the 
cooperativeness, and the good will of the subjects are factors of 
primary importance. All this, again, is more prominent in indi- 
vidual than in group tests. But defective rapport can make the 
administration of the best group test a waste of time. 

It should by now be quite apparent that the problem of relia- 
bility is not simple. Variable errors can come from many causes 
and they cannot be entirely eliminated. Also, a balance of factors 
is always involved, for it is quite possible to boost reliability at 
the sacrifice of validity or significance, as for instance by greatly 
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restricting the range of difficulty, or by reducing all items to a 
dead level of uniformity for the sake of accurate and quick scoring. 


2. How the reliability of a test is ascertained and recorded 

As a common-sense proposition there is only one way of ascer- 
taining the reliability of any obtained measurement, and that is 
to repeat it. This, in fact, is what any worker will do—a carpenter, 
for instance. If he finds that his two or three measurements of the 
same piece of board coincide closely, he will conclude that he can 
rely on any one of them sufficiently for practical purposes. Tf he 
happened to be interested in the theory of measurement, he would 
apply his rule many times, keep a record of the results, and see 
how closely they agreed. This would enable him to calculate the 
variability or amount of error involved. If he wanted to ascertain 
the reliability with which he was able to use his rule, he might 
make two measurements each of perhaps twenty different jobs. 
Then he would have two groups of twenty measurements, the first 
and the second try, and he could work out their correlations. This 
Would be a type of coefficient of reliability. 

Such is the procedure commonly used in mental measurement. 
The same instrument is applied twice to the same group. Tf the 
Eroup contains twenty persons, then there are two groups of twenty 
Scores each. The correlation between these two groups of scores 
is usually known as the coefficient of reliability. 

There is, however, a certain persistent difficulty in connection 
with mental measurement that does not affect physical measure- 
ment nearly so much. The carpenter can reasonably assume that 
no change has taken place in his boards or posts in the interval 
between the two applications of his rule. Moreover, the act of 
measurement itself does not affect the object being measured. 
Neither of these assumptions is equally safe when one is dealing 
with psychological phenomena. Many changes may actually take 
Place in a group of human beings between two sessions devoted 
to testing. Even if the time interval is short, there are such factors 
as fatigue, alteration of mood, and so forth, to be considered. And 
if it is long, then mental growth and specific learning can easily 
affect the situation. Also, a group of human subjects will learn 
Something from one experience with a test. So, although they may 
superficially seem the same at the time of the second testing, they 
may in reality have changed very considerably, and will certainly 
have changed to some extent. All this greatly complicates the 
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problem of ascertaining the true reliability of any test. The meth- 
ods most generally used for dealing with the matter are the 
following. 


A. One may repeat the same test with the same subjects. Then . 


the two sets of scores are correlated. The obtained coefficient is 
often called a reliability coefficient, but it is in reality a special 
class of reliability coefficient and should more accurately be termed 
a retest coefficient. This is a very frequent practice, but it clearly 
raises some serious questions. In a short time interval there is very 
likely to be some specific practice effect carried over from the first 
to the second testing. Also, there is very likely to be some general 
orientation to the test, its items, its timing, its setup, and so forth. 
If the time interval between testings is long, the obained correla- 
tion will probably reflect the effect of growth, of learning, and of 
environmental influences generally quite as much as it does the 
reliability of the instrument. The trouble in general is that even 
though the test itself is well constructed and avoids the more 
flagrant causes of variable error, the subjects to whom it is applied 
do not remain the same (Cronbach). 

B. Another procedure is to administer two forms of the same 
test to the same group of subjects, and to express the relationship 
between the scores in a correlation coefficient. Here we have an- 
other type of reliability coefficient, sometimes called the interform 
coefficient. When a well-constructed test comes in two or more 
forms, they are equated for difficulty, equality being established 
by the scores of a selected group or groups. Clearly the interform 
procedure avoids most of the objections involved in the retest 
procedure. Yet serious difficulties and reservations remain. The 
equivalence of the forms may be, and indeed probably is, super- 
ficial and imperfect (Cronbach). It is quite possible that two 
forms may be of equal difficulty for one group, but not for another. 
If so, the correlations will reflect this difference, and will be 
erroneous estimates of the reliability of the instrument. Moreover, 
the experience of having taken one form is almost sure to carry 
over to some extent to the taking of another; and this still further 
undermines the assumed equivalence. 

C. It will be noticed that both procedures just described work 
in terms of actual assumed equivalence. The second giving of the 
test, or the second form of the test is assumed to be equivalent to 
the first; and whether or not the assumption is justified, the pre- 
sumptive equivalent test is actually given. Another approach to 
the reliability problem turns upon administering only a single 


b! 
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testing, and proceeds in terms not of actual, but of hypothetical 
or “rational” equivalence, to use the expression of Kuder and 
Richardson (g.v.). 

“The most familiar method of doing this is to run the test once, 
and to split it into two halves. These two halves can be considered 
as two tests taken at one sitting, and a coefficient of correlation 
calculated between the two sets of obtained scores. But each of 
these halves is only half as long as the original. So the obtained 
coefficient must be boosted by the formula given on page 47, 
Which is known as the Spearman-Brown Prophecy Formula. Thus, 
if a test containing, in all, 150 items is treated as two tests of 75 
items each, and if the correlation of obtained scores on the two 
halves is .60, the formula would yield a self-correlation for the 


Whole test as follows: 
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Where r is the obtained correlation of .60, and r;, the derived co- 
efficient. Notice that this differs in principle from the retest and 
interform procedures, because two supposedly equivalent forms 
are not given, the hypothesis being that if two genuinely equiva- 
lent forms were given they would correlate as indicated. 

This avoids the difficulties considered in connection with meth- 
ods A and B. But it involves a serious objection of another kind. 
For there are a great many different ways of dividing a test into 
two halves (Kelley, 1942; Kuder and Richardson, 1937). And 
the obtained coefficient will depend upon the way the test is 
divided. Split-half reliability coefficients are very often reported, 
but all of them are infected with this uncertainty. i 

‘To meet this problem Kuder and Richardson (1937, 1939) have 
Proposed a technique for deriving a reliability coefficient from 
One giving of a test, without splitting it into two parts. They 
Present 22 formulae for doing this, among which they specially 
recommend No. 20. This formula requires nothing but the vari- 
ance of the test,* the number of items, the percentage of right 
answers, and the percentage of wrong answers. It is as follows: 

n 02 — 2pq 


) 1 2 Gt 


15:47) 


* The variance is the average of the squares of the deviations about the central 
tendency, i.e., the square of the standard deviation. For an account of its use in 
the statistics of psychological research v. Lindquist, also Hoyt. 
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where ~pq means the sum of the products of p and q for all test 
items, p the per cent of right answers given on a test item, q the 
per cent of wrong answers given on a test item, n the total num- 
ber of items, and oi the variance of the test. This technique, 
which avoids the bias involved in splitting a test, is coming into 
wide use. 

It should be emphasized once again that the coefficients ob- 
tained by both the Spearman-Brown and Kuder-Richardson 
methods differ in principle from retest or interform coefficients. 
The latter are obtained by actual correlations of two supposedly 
equivalent testings. The former involve only a rational equiva- 
lence, and express the internal consistency of the test (Kelley, 
1942; Jackson and Ferguson). 

D. In all the above instances, the reliability of a test has been 
expressed (or, more precisely, estimated) by means of a coeffi- 
cient of correlation. This expresses the correspondence or rela- 
tionship between the whole sets of scores obtained by two givings, 
by giving two forms, or on the assumption of rational equivalence. 
But it is often convenient to know just how much error is in- 
volved in any one given test score. This can be expressed by 
means of the standard error of estimate. The formula is as 
follows: 


SE=o0oyV(I-r2)* 


SE is a symbol frequently but not always used for standard 
error; G is the standard deviation; r is the obtained reliability 
coefficient. The practice of reporting reliability both in terms of a 
coefficient and of a standard error is quite common. Its advan- 
tage can be seen from the following hypothetical instance. 
Let us suppose we have a retest coefficient of .85 and a stand- 
ard deviation of 18. Then the standard error will be 9.36, Le., 
18 X VI — (.85)*. This tells us that there is a 68% chance that 
a person making any given score on the first testing will score 
within a range of + 9.36 to — 9.36 of that score on the second 
testing. That is, if a person makes a score of 170 on one testing, 
there is a 68% chance that his score on another testing (which is 
often called his “true score,” i.e., his score on any other testing) 
will fall between 161 and 170. 


* There are various qualifications and assumptions here that are not mentioned 
in the present brief exposition (v. Walker). 
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From all this it is quite clear that the reliability data reported 
in test uianuals and elsewhere cannot be taken at face value, but 
must always be critically scrutinized. Consider, for instance, the 
reliability coefficients shown in Table 5. In examining them a 
number of comments are in order. 


TABLE 5 


REPORTED RELIABILITY COEFFICIENTS FOR SEASHORE TEST OF PITCH 
DISCRIMINATION 


(From Mursell, Fig. 23, P. 293) 


ient 
Subjects Mafhods' | SeeMcisneof 


reliability 
157 music students retesting 54 + :.04 
132 AQULS vcs osse retesting ‘64 £.03 
music students ......e *|  retesting .90 3 .02 
normal school students . retesting ‘73 £.05 
college students ....... *|  retesting .68 £ .04 
White college students . retesting ‘88 £.02 
Negro college students ... retesting if E02 
White college students ....eceee 55 eos ত retesting .69 £ .03 
colored college students ..... TE .58 £.02 
93 high school students . retesting JT E..03 
59 music students .. retesting 76 £.03 
IS7 music students . split half .60 + .03 
200 MUSIC StUdentS ..sceceosceceocecceoes split half 51 £.03 
College students ....... 3 a sae split half 84 £02 
142-328 college students ....eeeeeseeee| Split half 74. 35:502 

75:25:03 
200 eighth graders ...cecceoeeeecseece|  retesting 83 £ .01 
Ioo eighth graders ... *|  retesting 85 £.02 
Seventh graders ... |  retesting .90 ££ .oI 
Sixth graders ...... *|  retesting 823507 
285 PpreadolescentS ..cescccceseeneceee split half 198.2508 
208 adolescents ....ecc.e PRB ES PE PA split half 40 + .04 


(a) Some of them are retest coefficients. Yet the time interval 
between test and retest is not stated in the table and often does 
not appear in the full and original research reports from which 
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the data are taken. In the case of this particular test, the time 
interval may not be of prime importance, although one cannot be 
sure that it is not. Very often it is. 

(b) The wide variation among these obtained coefficients is 
instructive to any student of testing. Such variation is commonly 
found in all similar reports. It is no doubt largely due to variable 
errors of one sort or another, which, as shown, can originate from 
a large number of causes. The actual test in this case is very 
competently constructed. It consists of 100 pairs of tones differing 
in pitch, beginning with fairly large differences and going to very 
small ones. But its efficacy depends upon the newness of the 
Phonograph record and also on the reproducing device and the 
acoustics of the room. Also, many people find a first adjustment 
to it difficult and annoying. Variable errors due to the instrument, 
then, cannot be ruled out. And errors originating in the agent and 
in the subjects are not only very probable but indeed certain to 
occur. 

(c) Another and purely statistical consideration must also be 
borne in mind. The greater the spread of ability in a tested group, 
the higher the obtained correlations will be, other things being 
equal. It is almost certain that these numerous groups who were 
used to calculate coefficients of reliability differed considerably in 
the spread or range of their capacity to discriminate differences 
in pitch. This in itself would result in different obtained coeffi- 
cients, even if all variable errors were avoided. 

A very pertinent question then arises. Which of these obtained 
Correlations expresses “the” reliability of the Seashore Test of 
Pitch Discrimination? The answer is: None of them! There is no 
such thing as “the” reliability coefficient 0f any test. There are 
only reliability coefficients obtained with different groups under 
different circumstances and registering more or less successful 
attempts to avoid variable errors of various kinds and coming 
from various causes. When a test manual reports a certain coeffi- 
cient, this actually means that the stated Coefficient has been 
Obtained in work with an experimental group presumably con- 
ducted with the greatest possible care. It does not mean that the 
instrument inevitably or automatically yields any such result. It 
only indicates that the authors have proved that their test can 
yield so much reliability under the circumstances created by their 
own administration of it. And in any case its meaning depends 
On its mode of derivation. 


1 


AS INSTRUMENTS OF MEASUREMENT 57 


3. How much reliability must a test possess? 


There is no one simple or standard answer to this question. 
How much reliability must physical measurements display? If a 
carpenter is making a sawhorse, considerable error is allowable. 
If a cabinetmaker is constructing a fine table, the margin of error 
must be much smaller. If a factory is manufacturing automobile 
engines, accuracy must again be greater. If airplane motors are 
being made, it must be greater still. And if a hundred-inch reflector 
for an astronomical telescope is being fabricated, then the utmost 
accuracy and the highest reliability that human ingenuity can 
compass is barely sufficient. So in physical measurements, the 
allowable degree of tolerance depends on the purpose involved. 
This is also the case with mental measurements. 

It is often said that if a reliability coefficient is calculated for a 
group with the range of one school grade it must be at least .50 
in order to discriminate between two group means with sufficient 
certainty so that there is a five-to-one chance of the difference’s 
being real. For individual classification, however, a test is said to 
require a reliability of at least .94 when calculated under the same 
conditions. These statements were made by Kelley more than 
twenty years ago (g.v., 1927, PP. 210-11), and have been quoted 
many times since. One formidable implication is that few pub- 
lished tests are adequate for individual diagnosis, and that they 
can at best be properly used only for rough group differentiation, 
for few of them have reliabilities of .94 and over, computed on 
a range of one school grade. 

The reply is, however, that while the above statements are 
beyond question correct on the assumptions upon which they were 
based, they do not take into consideration the actual uses to 
Which tests may be put (Guilford, 1946). We have already seen 
that if a test is to be used for very exacting selection, e.g., to 
Screen out all but ten, or twenty, or thirty per cent of prospects, 
it can have a relatively low validity and reliability and still be 
serviceable and dependable. Also a test may make a unique con- 
tribution, and so belong in a battery in spite of low reliability and 
Validity. Thus a short test requiring Judgments about the lengths 
of lines was set up in connection with airplane pilot selection 
during the war. It had a reliability of only .25, but because it 
brought out a unique factor, it belonged as a proper element in 
the selection procedures. 
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Kuhlmann (1939) has raised a question of fundamental sig- 
nificance in this whole connection. He contends that tests ought 
not to be constructed in such a way as to yield uniformly high 
reliabilities. His argument is that there are many variable con- 
ditions which are in fact important and relevant and that a sensi- 
tive instrument ought to register them. For example, he says a 
headache actually though temporarily affects mental ability, and 
a test so constructed and administered that it overrides and 
ignores the influence of the headache is false to the real facts of 
the situation. Many of the factors which we have called variable 
errors, particularly those originating in the subject, Kulhmann 
would regard as perfectly legitimate variations in the facts them- 
selves. Carried to an extreme, this argument would render all 
testing-and indeed all psychological experimentation futile because 
it would be overwhelmed by uncontrolled variable influences of 
many kinds. What Kuhlmann really seems to have in mind is that 
a test should be considered as a standard situation which a clini- 
cian or guidance counselor uses to help him in observing a sub- 
ject’s reactions. The score itself would presumably be of much 
less importance than the conclusions which the expert would draw 
in studying the performance of the person to whom he was admin- 
istering the test. To understand this position, it must be remem- 
bered that Kulhmann has advanced it in connection with his Tests 
of Mental Development, which is an individual instrument. The 
idea could hardly apply to group testing. It must presumably be 
interpreted as a claim that an individual test should be used 
primarily as an instrument of diagnosis rather than of measure- 
ment—perhaps as primarily projective and only in a secondary 
sense psychometric. 

From all this it is quite apparent that the issue of test relia- 
bility is far less conclusive than many people have been led to 
suppose. The upshot clearly is caution. Psychological tests are very 
far indeed from being automatically usable instruments with a 
known reliability which invariably appears in the scores obtained. 
No one set of test data can be considered final. So-called snapshot 
testing, in which instruments of measurement are applied once to 
groups of subjects and all sorts of long-range conclusions con- 
fidently drawn, should certainly be avoided. The total personal 
situation of the subjects must always be taken into account. But 
when all this is said, the fact remains that our existing psycho- 
metric tests, in spite of their limitations and uncertainties, have 
proved exceedingly valuable and constructive instruments, and 
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that their value can be enhanced by more discriminating and well- 
instructed use and interpretation. 


OBJECTIVITY 


By the objectivity of a measuring instrument is meant its free- 
dom from errors due to personal feeling and bias, its ability to 
reflect the facts of the situation irrespective of the “personal equa- 
tion” of the agent. The usual way of explaining this characteristic 
is to say that a test is objective to the degree that a number of 
different persons applying it to the.same phenomena and scoring 
it will come out with results which agree. 

Most psychometric tests achieve a high nominal objectivity by 
quite simple means. The items of which they are constructed 
permit only a single “right” response. These responses are scored 
by a key, or a stencil, or a machine. Sheer mistakes are of course 
always possible, but personal opinion and bias are eliminated. In 
Some individual tests, to be sure, the judgment of the agent is 
involved. Thus the subject’s response to some of the Stanford- 
Binet items, e.g., the vocabulary items, can be quite varied, and 
the person who is giving the test has to decide how “good” they 
are. Even so, the manual contains a wealth of very specific instruc- 
tions and limits and particularizes judgment, keeping it within a 
narrow channel. 

A high degree of objectivity is usually regarded as very desir- 
able and as relatively easy to achieve, but there are quite a num- 
ye of qualifications to consider before accepting this view out of 

and. 

1. When high objectivity is obtained simply by using items with 
Only one allowable and specified response, what is gained in one 
Way may easily be lost in another. For instance, in a test which 
Was given to a group of Indian children, the following item oc- 
curred: “Crowd—closeness, danger, dust, excitement, number.” 
Two of the last five words were to be underlined, on the basis of 
their congruence with the stimulus word crowd. The tendency of 
the Indian children was to underline “dust,” “excitement,” and 

danger.” In view of their special background and experience these 
Were perfectly intelligent responses. But they did aot agree with 
the scoring key of the test (Fitzgerald and Ludeman). Or again, 
in the Goodenough “Draw a Man” test of intelligence, the child 
is rated on the lines he puts in to represent the various parts of 
the human anatomy and similar representational factors. But a 
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Worker using such responses as projective indications might feel 
that they suggested all sorts of interpretations as to the types of 
personality, the emotional organization, and the femperaments of 
the subjects and might object to limiting them as the test requires. 
In dealing with true-false items again, only the standard “correct” 
responses receive credit. But if one could know the mental proc- 
esses by which unacceptable answers were arrived at, they might 
be considered indications of high intelligence and judiciousness. 
To put the matter in general terms, the objectivity actually mani- 
fested by a great many tests is arbitrary rather than real and to 
that extent spurious. It is something forced upon the Ssubject’s 
responses by the mechanism of the Scoring. 


dealing with objectivi 
(EH. F. Adams). 8 Objectivity 


3. This leads directly to another still 
mental issue. A test may be bhi 
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the presence of bias or subjectivity, or of an arbitrary element in 
the test. What this element is seems easy enough to indicate. The 
basic concept around which the test is built is never perfectly 
clarified, nor is it translated into test items with perfect con- 
sistency. The test rates fairly well on what its author considers 
intelligence and on the items into which he has managed to trans- 
late his operating conception, but it rates less well on a common, 
Or objective, opinion as to the nature of intelligence. The same is 
even more true of aptitude tests. Thus the Drake Test of Musical 
Memory and the Seashore Measures of Musical Talent both pur- 
Port to yield scores indicating musical ability. But the two authors 
Conceive of the function very differently. The same could easily 
be shown of different tests of mechanical aptitude. If we propose 
to measure the width of a door, we are not dealing with any one 
individual’s special view about the nature of width, but with a 
universal, i.e., an objective consensus. In tests of mental processes, 
however, we are invariably dealing with some individual concep- 
tion of what is to be measured. And although that opinion may be 
well informed, or well in line with commonly accepted views, or 
backed up by reasonably convincing evidence, it is sure to have 
an element of subjectivity. 

Thus the problem of objectivity, like those of validity and 
reliability, is not one that has been conclusively settled. No tests 
even approximate to perfect validity, reliability, and objectivity in 
all senses. Still, their practical value indicates a theoretical authen- 
ticity that goes about as far as it can in the present status of 
Psychological science. 


STANDARDIZATION 


[ By the standardization of a test is meant the establishment of 

forms for the interpretation of the results it yields. This is partly 
A matter of the organization of the test itself and partly of inter- 
Preting the scores obtained when it is applied. These two con- 
siderations are interrelated because the scores which result when 
the test is given depend upon its organization’) If it is an age- 
‘Scale, like the Binet scale of 1908 (see Figure 4) and numerous 
other instruments following the same general pattern, the subtests 
are grouped at a series of age levels, and the test immediately 
yields a mental age score. This, of course, means that a process of 
Standardization has already been accomplished. Tf it is set up like 
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Army Group Intelligence Examination Alpha and many similar 
instruments, it yields an ordinary numerical score which must be 
further interpreted before it can be used for guidance or other 
purposes. 

To make the matter specific, consider the material presented in 
Tables 6 and 7. The data in Table 6 consist of a set of direct, or 
numerical, or “raw” (i.e., uninterpreted) scores made on Army 
Alpha by a group of college students. A direct inspection of these 
scores can reveal only one thing—the relative positions of these 
individuals within the group. 

But we want to know much more than this, and we must know 
more if the test and its results are to mean anything or to be of 
any significant service. Army Alpha purports to measure intelli- 
gence. How much intelligence, then, do these scores indicate? We 
note that 201 is the highest and 126 is the lowest. How much 
more intelligence has the person who makes the former than the 
person who makes the latter? What is indicated by the scores that 
lie between the highest and the lowest? Does the group as a whole 
rank high, or low, or about average in the scale of intelligence? 
A mere inspection of the scores cannot answer any of these ques- 
tions, but a reply of some kind is essential if they are to have 
either theoretical or practical value. 


TABLE 6 


SCORES OF 54 COLUMBIA COLLEGE MEN ON ARMY GROUP INTELLIGENCE 
EXAMINATION ALPHA 


(Quoted from Garrett, 1926, Table 1, P. 3) 


EY 
I85 127 168 177 157 172 
201 160 184 137 164 176 
188 ISI 188 195 I85 164 
195 185 179 182 158 II 
I; I85 178 144 170 
174 183 126 154 189 
158 179 I55 I77 198 
197 188 169 165 188 
176 185 146 153 160 
138 I55 ISI IOI 157 
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An illustration from another field may serve to make the issue 
clear. Suppose one is presented with a tabulation of 52 measure- 
ments of height, ranging from 46” to 70”. What do they mean? 
It all depends on what it is they measure. If they are a set of 
measurements of the standing height of 52 human adults we can 
interpret them readily. 46” is very short, and 70” is very tall. 
Comparatively few human adults are at the limits of such a range. 
The scores for height, unintelligible in and of themselves since 
they are no more than a set of numbers, become intelligible when 
projected against the whole range of the statures of adult human 
beings. 

This is exactly the principle employed in interpreting the raw 
scores yielded by a psychological test. One method of interpreting 
them is illustrated in Table 7. Here the scores of many thousands 
of individuals on Army Alpha are grouped into seven categories, 
designated by the familiar letter-grade symbols, with the per- 
centages of the total large group falling into each category indi- 
cated. The method by which this classification was arrived at is 
Worth understanding. It was done by considerable cut-and-try and 
discussion among the psychologists who devised and administered 
the test. Their final decision turned on two considerations. First, 
the grade of E was defined as meaning unfit for regular service 
and was not retained. In fact, examiners were strictly prohibited 
from giving a rating of E on this test. Then, after a good deal of 
experimentation, the classification shown was decided upon for 
the reason that, in the opinion of the experts responsible, it rep- 
resented the most “satisfactory distribution” of raw scores. 

There is now no difficulty in making an evaluation or interpre- 
tation of the scores of the 54 Columbia students shown in Table 6. 
The score of 20r indicates an intelligence rating of A. The score 
of 126 indicates an intelligence rating of B. The whole group is 
definitely high in the scale of intelligence. Such seems to be the 
evident conclusion. 

But there is a step in the argument which has not yet been 
made explicit. Statements have been made about the rating of 
these 54 individuals with reference to intelligence as such, or with 
reference to human intelligence in general.* All that has been 
actually proved, however, is how they rate with reference to the 
test performance of the very large group of persons on whom the 
letter classifications were worked out. The underlying assumption 

* In strictness the reference is to American male human beings of draft age. 


66 PSYCHOLOGICAL TESTING 


groups. A score of 72 on the test is better than that of 99.4% of 
an unselected group of persons 18 years old, better than that of 
99% of high school seniors, and better than that of 98% of college 
students. It should be noted that these equivalents were developed 
with specific groups of given composition, and that they are by 
inference generalized to apply to all unselected eighteen year olds 
all high school seniors, and all college students. 

Second, both tables show the interpretation of raw scores by 


ty 
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tions above the mean will be better than about 99% of the cases. 
With the distribution shown in Figure 295, it is better than any 
obtained score, for the best performance is only about 1.25 stand- 
ard deviations above the mean. 

Clearly, then, the use of standard score equivalents, like the 
use of percentile equivalents, is a method of interpreting raw scores 
by classifying them on the basis of the performance of the stand- 
ardization group. When a given raw score is equated to a certain 
standard score, what is really said is that a certain proportion of 
the standardization group exceeds or falls below that raw score. 
And once more the assumption is made that the standardization 
group is representative, so the standard score is taken to mean 
that a certain proportion of all persons to whom the test applies 
falls below the raw score of which it is the equivalent. 

Table 8 illustrates yet another procedure for the interpretation 
of raw scores, i.e., their classification on a basis of mental age 
values. An Otis raw score of 24 is equivalent to a Stanford-Binet 
mental age of 12-2 (twelve years and two months). Now the two 
Stanford Revisions of the Binet Scale are both age scales. That is, 
the subtests of which they consist are assigned to certain age 
levels. The statement of equivalence in Table 8, then, means that 
a person who is able to make a score of 24 on the Otis test will be 
able to pass the Stanford-Binet tests up to the level of 12-2. This, 
Of course, was ascertained by giving both the Otis test and the 
Stanford-Binet test to at least a part of the standardization group. 
And again the meaning of the statement is generalized by implica- 
tion on the assumption that the standardization group is repre- 


Sentative. 


It would, of course, be perfect t 
age equivalents for the Otis scores directly, without any reference 


to the Binet scale. These would be either the mean scores for each 
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a true sample. And the interpretive classifications are generalized 
to apply to any persons who may take the test. 

At a later point in this book, and after dealing with the 
numerous and far-reaching questions involved, we ‘shall return to 
a more detailed analysis of these Various types of scores and to 
the whole problem of standardization. For the time being, how- 
ever, the important point is to understand its technique and its 
inferential character. The uninterpreted raw Scores yielded b 


) t 1 indication of intelligence, or 
mechanical aptitude, or musical ability, or what not, the scores 


Of the general perform- 
, there is no Possibility of 
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valid only if its standardization group is a true sample. For in- 
Stance, a revision of the Binet scale made and standardized in 
England, using English children as the standardization group, was 
applied to Kaffir children. It was found very misleading. Among 
the subtests are some that call for decisions about dates, about 
coins, and about colors. But the Kaffir children had no experience 
with either dates or coins, their language contains no word for 
yellow, and they have difficulty in discriminating green from blue 
(Martin). The question of a universal test of mentality, applicable 
to all human beings everywhere has been mooted, but nothing 
approaching it has ever been achieved because of the enormous 
differences among different racial, national, and other groups of 
men (Schieffelin and Schwesinger). 

2. It is always a matter of doubt whether two tests which 
nominally deal with the same function—intelligence, mechanical 
ability, artistic ability—are strictly comparable. Partly, this is 
because their authors may use different statistical procedures in 
computing the interpretive norms. But the question Chiefly arises 
because the two tests are constructed with reference to different 
Standardization groups. 

3. The basic orienting concepts on which tests are built are 
never perfectly clear, explicit, and unambiguous. This, however, 
is not a criticism of Psychometrics alone. It is due to the present 
status of psychological science. As a better understanding of the 
Organization and operation of the human mind emerges, it will be 
Possible to construct better tests. u 

4. The work of translating the orienting concepts into instru- 
ments of measurement as valid, reliable, “Objective, and clearly 
interpreted as Possible is subject to many limitations and much 
doubt. It is, however, essential if there are to be instruments of 
mental measurement at all; and rough and crude though it may 
be, it has proved successful in considerable measure. 

5. It is always dangerous to assume that a mental test can reveal 
and measure intelligence, or aptitude, or talent, or attitude, or 
Personality type in general, or can uncover its universal essence. 
It can only reveal and deal with any such function or trait in the 
setting of a particular population, though perhaps a very large one. 
Presumably, intelligence has certain fundamental aspects that are 
identical in its manifestations among Whites, and Negroes, and 
Chinese, and Russians, and Australian Aborigines. So one may 
assume that there are at least elements of universality in any well- 


7o PSYCHOLOGICAL TESTING 


constructed test of a mental trait or characteristic or function. 
But to carry the processes of generalization too far is extremely 
risky, and the risk is greatly increased if one is not aware of what 
one is doing. 
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QUESTIONS FOR DIscUsstoN 


1. Examine the manuals of several 
types listed in Bibliography II at the 
On the methods used in securing and as 

2. How clearly and adequately do t 
define their working concepts? How 
select test items to implement them? 

3. Suggest any steps that might be taken in the administration of 
a test to increase its reliability. 

4. Which of the causes of unreliability mentioned by Sy 


representative tests of various 
close of the book and report 
certaining validity. 
he authors of the above tests 
Successfully do they seem to 


monds 
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seem to you to be more or less under the control of the person 
administering any test? 

5. How is the validity of a test affected (a) by its reliability, (b) 
by its objectivity, (c) by its standardization? 

6. What distinction might be drawn between the objectivity of 
the items of a test and of the test itself? 

7. What would be the effect of using the 54 Columbia students 
whose scores are tabulated in Table 6 as the standardization group 
for Army Alpha? 

8. Suggest some measures that might be taken to secure a stand- 
ardization group for any test that would be really representative. 

9. Discuss and consider the implications of Kuhlmann’s claim that 
high reliability is not always desirable in a mental test. 

10. Report and discuss any statements you may have heard that 
seem to imply a belief that tests deal with mental functions “as 
such” irrespective of their manifestation in some specific group or 
population. 


CHAPTER II 


THE CONCEPT OF GENERAL INTELLIGENCE 
THE CoNcEPT: ITS IMPORTANCE AND BACKGROUND 


The whole modern testing movement has centered about the rise 
of a working concept of general intelligence. Pragmatically it still 
remains the most successful and fruitful of all the operating 
hypotheses that have emerged, both in its effects in the way of 
test construction and in the results that have accrued when tests 
have been applied to the solution of human problems. By and 
large, tests of general intelligence are still the best we have. Psy- 
chometric instruments intended to measure other mental functions 
have been modeled upon them, both in form and in method of 
construction, and sometimes they have succeeded and sometimes 
they have failed. One of the most important psychometric develop- 
ments in recent years has been the emergence of the doctrine that 
mental organization depends upon an array of separable and 
definable factors rather than upon a unitary general intelligence. 
Just how far the theory of factors is a departure from earlier 
views it is not yet possible to determine decisively, for the ques- 
tion is still controversial, but Probably it can be regarded as an 
extension, or elaboration, or redefinition rather than as a com- 
pletely opposing point of view. An increasing number of excellent 
tests are being constructed on the basis of factor theory, but 
Whether they will prove decisively superior to the earlier tests of 
general intelligence, which are still in very wide use, it is too soon 
to say. 

The present chapter will be devoted 
tempts to define, describe, Characterize, 
conception. In the following three chap 
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ground of its emergence and development. It has figured as an 
element in a type of psychological thinking that has become 
prominent during a period roughly coinciding with the present 
century. Earlier psychology was much influenced by the widely 
accepted hypothesis of mental faculties, which were thought of as 
more or less independent capacities, such as reasoning, memory, 
imagination, emotion, musicality, philoprogenitiveness, and the 
like. The tendency was to try to explain any kind of mental per- 
formance in terms of the isolated action of the appropriate faculty. 
The faculties were often referred to as “mental muscles,” and 
there was a current belief that they could be strengthened or 
“trained” by appropriate exercises. It was assumed that the ability 
or potentiality of any person was due to what would now be called 
the profile of his faculties. This view was the logical basis of the 
practice of phrenology, which, however, involved the further as- 
sumption that the faculties of the mind could be determined by 
the contour of the skull. Phrenology never had full status in the 
best psychological thinking, but it was accepted enthusiastically 
by many very able and serious men—for instance, Horace Mann. 

Neither faculty psychology nor phrenology directly influenced 
the development of .psychological testing, which was carried to 
considerable lengths during the nineteenth century, but they 
affected it indirectly. Tests were constructed to measure very spe- 
cific and limited functions, many of them psychophysical, such as 
auditory acuity, attention span, voluntary attention as measured 
by the cancellation of certain letters in printed material among 
other means. Such tests might very readily be objective, reliable, 
and in a narrow sense valid, for they measured what they were 
intended to measure, but they lacked any broad significance or 
utility. It was found that they had, for instance, very little rela- 
tionship to success in school, or to any of the higher and more 
complex and important manifestations of mentality. It cannot be 
Said that they were derived directly from faculty psychology, for 
they did not undertake to measure these alleged independent men- 
tal entities. But so long as this view prevailed, it was a major 
obstacle to the development of a different and better type of psy- 
chometric instrument. 

The central element in the change that has taken place is the 
insistence that the mind or personality always acts as a whole, and 
not in segments, and that investigation must always take account 
of its total action if any real understanding is to be achieved. Such 
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terms as memory, reasoning, imagination, feeling, and so forth 
are of course retained. But they are thought of as functions of the 
entire personality, ways in which the mind as a whole deals with 
situations, and not as separate subdivisions or unitary capacities. 
One consequence has been the abandonment of the doctrine of 
mental training, or formal discipline, which assumed the possi- 
bility of strengthening separate faculties by exercises in which the 
content and setting were matters of indifference. And long before 
this viewpoint received its characteristic present-day expression in 
the work of configurationalist, or holistic, or Gestalt psychologists, 
it led to the rise of a different type of mental tests. 

This evolution of psychological thought and practice, in so far 
as it affected mental measurement, has been associated with the 
name of Alfred Binet. As early as 1895 he became interested in the 
development of a type of test very different from the then current 
measuring devices which dealt with narrow and highly specific 
functions. He worked out tests for general memory, mental imag- 
ery, and comprehension. But as yet his guiding and operating 
concepts were very vague. He was still feeling his way, and these 
early tests were largely fruitless. 

‘The turning point in his career came when he was commissioned 
to investigate the capacities and possibilities of children in the 
Paris schools, particularly those of low mentality, and to find 
means for differentiating at an early age those who had educa- 
tional promise from those who lacked it. It is very notable and 
significant that the modern testing movement has its source in a 
practical problem. Binet’s first solution to the problem which was 
set for him was his earliest more or less complete set of tests, 
which are shown in schematic outline in Figure 2. | 

The 1905 tests which Binet published and used contain many of 
the features of present-day scales. Some of them are holdovers 
from an earlier day, such as the comparison of lengths of lines and 
the comparison of weights (numbers 21 and 22). But the main 
emphasis is upon true intellectual’ tasks, instances being the un- 
wrapping of candy, the execution of commands, naming of com- 
mon objects, sentence completion, paper folding, defining abstract 
terms, and numerous others. These tests, as will be Seen, were not 
formally arranged as an age scale, but Binet was already aware 
of the significance of age differences and their relatio 
mentality. He pointed out that children of the same a 
different mental ability will pass different numbers of 
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claimed that imbeciles could pass those from 7 to 15, and that the 
Various higher levels of mentality would show differential success 
with the rest. And above all, in spite of the inconsistencies noted 
above, which were eliminated from his later work, it is clear that 
he had in mind a comprehensive survey of the individual mental- 
ity, which had little to do with psychophysical processes and 
which was in effect a break with the assumption of separate 
mental faculties. It is very noteworthy that this extraordinarily 
pregnant idea was put into operation almost casually, without any 
fanfare or doctrinal trimmings, as the obvious answer to a prac- 


tical challenge. 

_]_ The concept of general intelligenc 
First, it stands for a way of appraising and dealing with people. 
A human being is not to be understood as a composite of special 
faculties. If he is to be appraised at all, if his promise and limita- 
tions are to be assessed, this should be done in terms of a general 
over-all evaluation of something perhaps hard to define, but fairly 
recognizable, i.e., his general intellectual capacity. 

Second, it stands for a way of thinking about human beings 
and human mental life. Mental life must be considered as a total 
interconnected organic unity, and not as a sum of independent 
Parts. The full implications of this doctrine can emerge only little 


Y little, and are far from being wholly understood as yet. But 
he emergence of the concept of gen- 


eral intelligence amounts to a change in psychological theory and 
ing hand in hand. 


2 change in psychometric technology SoIng IE, 

Can it be said that general intelligence “exists, whereas sepa- 
rate faculties do not? Whatever the metaphysics of the problem 
may be, such a claim is in no Way necessary. The essential point 
1S that it provides a better working concept in our endeavors to 
Understand and deal with human nature. A concrete and very 
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Ability to follow a moving object with the eyes. 

Grasping a piece of wood brought in contact with the hand. 

Showing a piece of wood to see if child will grasp it. 

Choosing between piece of chocolate and piece of wood. 

Candy wrapped in paper presented to see if unwrapped. 

. Execution of simple commands and imitation of simple gestures. 

Knowing names of parts of body and simple objects. 

. Indicating things in pictures in answer to questions. 

. Naming common objects in picture. 

. Telling which of two lines is longer. 

. Repetition of three digits. 

. Telling which of two weights is heavier. 

* Asking for objects not present, for things in picture by nonsense 
word, comparison of three unequal lines, then three equal ones. 
(suggestibility) 

14. Definition of objects. 

I5. Definition of sentences. 

16. Indicating differences between pairs of objects. 

17. Thirteen common objects shown in picture for thirty seconds, 

after which child recalls as many as possible. 

18. Drawing designs from memory after ten seconds’ exposure. 

19. Repetition of digits. 

20. Indicating resemblance between pairs of objects. 

21. Comparison of lengths of lines. 

22. Comparison of weights. 

23. Memory for weights shown by remembering which of weights 

placed in order is missing when one has been removed. 

24. Finding rhymes to given words. 

25. Completion of sentences. 

26. Making sentence including three given words. 

27. Comprehending questions graded from easy to hard. 

28. Reversing clock hands from memory. 

29. Cutting a triangular piece from paper folded twice, task being 

to tell what it will look like unfolded. 

30. Defining abstract terms. 
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kindergarten, also with very happy results (Webb and Shotwell). 
Could these wise and successful adjustments have been made if 
there had been an attempt to estimate the patterns of faculties 
manifested by these two children? Almost certainly not. Would 
tests of visual acuity, auditory acuity, attention span, differential 
threshold for weight, and the like have furnished a proper basis 
for decision? Again the answer is no. But a general over-all survey 
of mentality provided what was necessary. Is it to be anticipated 
that some day there may be tests not based on general intelligence 
as now understood, which would tell the story better and reveal 
the facts more discriminatingly? This is entirely possible. Faculty 
Psychology provided a poor set of working conceptual tools. Our 
present holistic psychology provides a much better set. Still better 
conceptual instruments will very probably be developed in the 
future, such indeed being precisely the hope and intention of our 
present analytic techniques. This is the only essential point either 
for scientific or practical purposes, and the “reality” of the con- 
cepts is a question which need not arise at all. 


DEFINITIONS OF GENERAL INTELLIGENCE 


Many writers on the subject of general intelligence have felt 
obligated to put forward compact and more or less formal defini- 
tions of it. This, no doubt, is a praisworthy endeavor and an 
important one. Although the results seem somewhat confusing at 
first sight, valuable insights can nevertheless be gained from them. 
It is certainly not necessary here to attempt anything like a com- 
prehensive catalog of such definitions, but a reasonable number 
of samples is well worth considering. Here are a few such with 
their chronological arrangement indicated. 

/ ‘Intelligence is a general capacity of the individual consciously 
to adjust his thinking to new requirements” (Stern, 1914). “Any 
Sort of attentive memorial or perceptive activity is at the same 
time an intelligent activity just in so far as it includes a new 
adjustment to new demands” (same author). “Intelligence means 
Precisely the property of so recombining our behavior patterns as 
to act better in novel situations” (Wells, 1917). Intelligence 
Seems to be a biological mechanism by which the effects of a com- 
plex of stimuli are brought together and given a somewhat unified 
effect in behavior” (Peterson,)‘Intelligence and Its Measurement,” 
1921). To be intelligent a test subject “has to see the point of the 
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problem now set him, and to adapt what he has learned to this 
novel situation” (Woodworth, “Intelligence and Its Measure- 
ment,” 1921). (“Intelligence is the ability to learn” (Buckingham, 
“Jntelligence and Its Measurement,” 19271). “An individual Pos- 
sesses intelligence in so far as he has learned or can learn to adjust 
himself to his environment” (Colvin,.“Tntelligence and Its Meas- 
urement,” 1921)..“An individual is intelligent in proportion as he 
is able to carry on abstract thinking” (Terman, “Intelligence and 
Its Measurement,” 1921). “We may then define intellect in gen- 
eral as the power of good responses from the point of view of 
truth or fact” (Thorndike, “Intelligence and Its Measurement,” 
1921). Thurstone (1923) characterizes intelligence as a movement 
from trial and error towards increasingly abstract controls.(“«Tn- 
telligence may be regarded as the capacity for successful adjust- 
ment by means of those traits which we ordinarily call intellectual. 
These traits. involve such capacities as quickness of learning, 
quickness of apprehension, the ability to solve new Problems, the 
ability to perform tasks generally recognized as presenting intel- 
lectual difficulty because they involve ingenuity, Originality, the 
grasp of complicated relationships, or the recognition of remote 
associations” (F. N. Freeman, 1925). “Intelligence is the ability 
to learn actions or to perform new actions that are functi 
useful” (F. N. Freeman, 1940). 


From this brief survey of typical attempts to define general in- 
telligence several significant points arise. 

1. The prevailing impression is not one of contradiction but of 
vagueness. It is true that there are differences in emphasis, but 
on the whole it would not seem very difficult to reconcile these 
various formulations. Indeed one might not unjustifiably suppose 
that the authors represented would, on discussion, find themselves 
in substantial though no doubt not entire agreement. One feels 
that they are dealing with a conception that is essentially hazy 
although no doubt meaningful, and that cannot be pinned GWE 
within the scope of a compact statement. 

2. In order to bring out the common meani 
divergencies among these definitions, 
ers have tried to assign them to a scheme of classification. Thus 
Pintner (1937, PP. 47-51) Sroups them as biological, educational 
faculty, and empirical. Again F. N. Fr. ৰ 


piri \ eeman (1939, 1040) classifies 
them as organic, i.e., those which characterize intelligence as a 
characteristic of organic constitution ; social, i.e., those Which em- 


onally 


ngs and also the 
at least two important writ- 
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phasize its dependence upon symbols and cultural concepts; and 
behavioristic, i.e., those which define it in terms of performance 
on a given test. Unfortunately, neither scheme of classification 
seems very enlightening or very successful in bringing a tidy order 
out of a baffling array of formulations. The various groupings 
overlap, and it is often hard to say just where any given definition 
should be finally assigned. What Pintner and Freeman have ac- 
tually succeeded in doing is to play up the indubitable truth that 
general intelligence is a vague concept and that it may properly 
be considered from many different though by no means conflicting 
viewpoints. 

3. The chronological pattern of these definitions, which was 
emphasized in presenting them, is perhaps more revealing than 
any attempt at a topical classification. As will be seen, they ex- 
tend all the way from 1914 to 1940. The impressive thing that 
emerges from a chronological survey extending over twenty-six 
years is the repeated appearance of the same points, with only 
minor additions and shifts of emphasis. To refer to a specific case, 
Freeman’s formulation in 1940 is substantially the same as the 
one he offered fourteen years earlier, except that in the former he 
mentions functional usefulness. Presumably this would mean that 
intelligence manifests itself more adequately in administrative 
decisions or scientific research than in chess or crossword puzzles. 
But even this is not certain, because functional usefulness can take 
in a great deal of territory. 

4. There is only one type of definition of general intelligence 
to which express exception must be taken, and no instance of it 
occurs in those cited above. This is any definition or description 
which claims that intelligence is hereditary. Boynton (g.v.) for 
example, begins his fourfold description by stating that intelli- 
gence is “an hereditary capacity.” The objection to this is not 
that the proposition is false, which may or may not be the case, 
but that it prejudges issues which can be settled only by investi- 
gation, if they can be settled at all. To begin by defining general 
intelligence as hereditary is to beg a whole range of momentous, 
practical, and theoretical questions 

The above will serve to give the reader a fairly adequate ac- 
count of attempts to reduce genera intelligence to a definition. He 
may very well feel that so far as he is concerned, the upshot is 
that he still does not know exactly what it is. If he undertakes to 
make a more exhaustive study of the literature to which references 


We 
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are made in the text and the bibliography, this impression is likely 
to be strengthened. It is perfectly correct. That, indeed, is the- 
essential point. General intelligence is a loose and Vague concept. 
It is not the same idea as Thorndike’s “altitude of intellect,” 
Which means precisely the level of difficulty that can be attained 
in a graded scale of intellectual tasks. Nor is it the same idea as 
Spearman's “general factor,” which, so far as testing is concerned, 
means precisely the education of relationships and the education 
of correlates. It was used by Binet as a loose, vague, but still 
Significant guiding hypothesis—as a warrant for assessing human 
beings in terms of a comprehensive Survey of their performance 
on what would ordinarily be considered intellectual tasks. And it 
has been so understood and used ever since. The ultimate defense 
of this concept is not its theoretical clarity but its Workability, It 
Shares the pervading Vagueness of all holistic or configurationalist 
psychology. But perhaps nothing more than vagueness is possible 
in the present state of our Psychological knowledge. Perhaps 
attempts at too great precision are Premature, and end only in the 
Production of triviality and falsehood. At any rate, the net testi- 
mony furnished by the testing movement to date is that the holis- 
tic point of view, indefinite though it may be in many respects, 
actually pays out in appreciable Practical success. 


DEscRIPTIONS OF GENERAL INTELLIGENCE 


The attempt to compress the concept of 
into a.compact definition is no doubt attractiv 
plex, too many-sided, too wide-ranging, 


cally and practically, decisively appear. 

Stoddard presents an elaborate descriptive anal 
under seven headings. “Intelligence is the abilit 
activities that are characterized by (1) difficult 


ysis, massed 


Here is a comprehensive Statement that merits Careful attention 
to all its details. Difficulty must not be thought of as an ability 
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to do unusual freak tasks, such as the feats of the lightning calcu- 
lator, or the defining of extraordinary words, or “information 
Please” memory stunts. It means the capacity to perform high- 
level intellectual tasks such as those of higher mathematics, highly 
Organized artistic or literary production, strategy, administrative 
decisions, and the like. Complexity is not a mere matter of addi- 
tive range. It refers to the ability to hold together many considera- 
tions in a unitary effort, such as manifests itself in any high-level 
Skill, or in complex research. Abstractness is the key characteristic 
of all high-level mental operations. It means freedom from the 
immediate, from trial-and-error processes, such as is gained by the 
free use of verbal and other symbols. It is well typified by the 
mathematical sequence to greater abstractness from arithmetic, 
through algebra, to calculus, and beyond. Thus the above three 
Characteristics all have to do with mental organization, and can 
no doubt be quantified to some extent by existing psychometric 
techniques. d 
Stoddard finds economy a better word than speed, for it means 
Moving towards a goal or performing a task without irrelevancies. 
Addaptiveness has always been recognized as a characteristic of 
‘Intelligent behavior. The emergence of originals in working meth- 
Ods and results is highly characteristic and indicative of superior 
Mentality but is hardly recognized in psychometric tests. Concen- 
tration of energy on a purpose highly important. It must not, how- 
ever, be interpreted as a blind sticking to assigned tasks, but 
rather as self-direction and persistence in significant endeavor. 
Oreover, intelligence involves resistance to emotional blockages 
and distractions, such as those coming from popular shibboleths, 
a Vertising slogans and claims, prejudices that ignore reason, 
self-distrust, and so forth. The mention of the Social significance 
©: Problems is noteworthy. It is characteristic of much modern 
thought On the problem and nature of intelligence, the suggestion 
Cing, as previously remarked, that intelligence EN Re 
etter in the social planner than in the chess expert. Also the 
i Ce Of meaningful tasks and Ee is Tega as part of the 
otal pic f a]] general intelligence. 
loyetor Te hee hd ee and extensive characteri- 
zation. (a) Thtellisence is hereditary according to the first point in 
is description. This, as we have seen, is open to fundamental SyS- 
ematic Objection. It ought not to appear in a general definition or 
scription, quite apart from its truth or error. ( b) It involves not 
only adaptation but reconstruction. Here again is the emphasis 
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upon original solutions and constructions, though differently 
phrased, for reconstruction clearly indicates the discovery of new 
solutions and the alteration of circumstances instead of the mere 
adjustment to them. Boynton, like Stoddard, remarks that this 
aspect of intelligence is not much recognized in tests. (c) The best 
indication or criterion of intelligence is the behavior of the indi- 
vidual in his group. Once more, from a somewhat different angle, 
we have the emphasis on social significance in the tasks or activi- 
ties in which intelligence manifests itself. A child may appear 
stupid in school or in test situations, but not so in a satisfying 
social setting. (d) A characteristic manifestation of intelligence is 
to look beyond the temporary and to envisage alternative group 
needs. This, to repeat, is definitely a less complete characterization 
than the foregoing. The emphasis on social significance and group 
action as related to intelligence is noteworthy. 
In sharp contrast with the descriptions just summarized is that 
of Thorndike (Thorndike and Others, 1927), which has already 
been briefly mentioned in these pages. For Thorndike, intelligence 
has four attributes—level, range, area, and speed. (a) Level (or 
altitude) refers to the difficulty of tasks that can be performed. As 
he puts it, if all the intellectual tasks in the universe were ar- 
ranged in a series of increasing difficulty, the level of any indi- 
vidual intellect would be determined by how far along the series 
it could go. Level is by far the most important single aspect of 
intelligence, and nothing can substitute for or offset it. It cannot, 
however, be measured entirely in isolation. (b) The range of in- 
telligence refers to the number of different tasks that can be 
achieved at any given level. Again, slightly to paraphrase Thorn- 
dike’s formulation, he puts it as follows. Tf all possible intellectual 
tasks in the universe were rated for difficulty, all those on equal 
levels of difficulty would constitute range. He considers that in 
theory a person should be able to do all conceivable tasks on his 
intellectual level, though he admits that practically this is not 
possible because of lack of opportunities to learn. For instance, 
this would mean that a chess expert should be able to solve mathe- 
matical or scientific or strategic or administrative problems Up to 
the difficulty level of the chess situations he can handle. In intelli- 
gence tests, according to Thorndike, range is represented by tasks 
of different kinds but of equal aitnculty, and range and level cor- 
relate almost perfectly. This means the claim that competence 
and versatility go together almost perfectly. Inborn intelligence 
is what determines level, but it cannot be measured without intro- 
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ducing range, i.e., without introducing tasks of more than one 
kind. (c) Area is simply the summation of all ranges. It is not, for 
Thorndike, very important for measurement. Clearly this is im- 
plied in his whole position, for if a person can do all kinds of tasks 
On his top level of difficulty, there is no point in giving him numer- 
ous diversified tasks that he can do easily. He points out that a 
Privileged and an underprivileged child may both have the same 
level, because they are identical in inborn ability, but they may 
differ in the area of tasks they can compass because one is helped 
by wide training and experience and the other is not. For him, the 
much mooted effect of training in a good nursery school would be 
to increase area but not level. (d) Speed is a significant aspect of 
intelligence, but less closely bound up with the essential attri- 
bute of altitude or level than are the other two. 

It is highly instructive to compare these three descriptions, for 
they manifest divergencies that are both startling and revealing. 

1. The clear meaning of Thorndike’s formulation is that the 
kind of job on which intelligence operates, and in and through 
Which it is revealed, does not matter, except in so far as it is an 

intellectual” task. In particular social significance, meaningful- 
ness to the subject, interest, and so forth do not matter, at least 
in theory. So far as intelligence is concerned, the difficulty of the 
task is by all means the primary consideration. This, of course, 
explains the building of the IL.E.R. Intelligence Scale CAVD, which 
Consists of four lengthy series of four kinds of tasks—completions, 
arithmetic problems, vocabulary, and directions. 

2. Difficulty is thought of by Thorndike in terms of isolated 
and isolable jobs that are ranked in order according to the per- 
centages of a standardization group by which they are passed. It 
1s very different from Stoddard’s conception, which associates diffi- 
culty with complexity and abstractness as one phase of a unitary 
Process of mental organization. Also, Thorndike’s assumption is 
that when a certain order of difficulty is established it will be 
theoretically the same for all persons, again at least in theory. Of 
Course the idea of isolating all conceivable intellectual tasks and 
ranking them in order of difficulty almost reduces one of the 
common techniques of test construction to an absurdity. Test 
items are very frequently ranked in difficulty in terms of the per- 
formance of a standardization group. But to equate this ranking 
os a serial order of all possible tasks is a very formidable logical 
eap. 

3. Consistently with the assumption that any kind of task is 
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equally indicative of intelligence so long as it has a stated level 
of difficulty, Thorndike takes no cognizance of reconstructive 
activities or the emergence of originals. fl | 

4. Speed apparently means for Thorndike the same thing with 
many easy tasks, or in the attack upon exacting, complex, disturb- 
ing and baffling problems. But surely there is an essential differ- 
ence. In the first case we have merely a process of rapid enumera- 
tion—the doing of one thing after another at a given rate. In the 
second situation we have choice among alternatives, recognition 
of choices that would be hopeless, vital exploration, vital decisions, 
an awareness of relevance. And here rapidity may be highly im- 
portant and highly indicative. In this latter situation, in other 
words, intelligence is manifested by what Stoddard calls “econ- 
omy” (see also Kuhlmann, 1939). 

5. Range and area are expressly characterized by Thorndike in 
terms of the number of tasks of different kinds that occur and 
can be done. Apparently painting with oils and with Water colors 
On the one hand, or painting with oils and fixing an automobile on 
the other would be pairs of tasks that would have the same value 
and meaning with regard to range. 

Thorndike’s description of intelligence is one of the best in- 
stances of what critics have in mind in saying that testing is based 
on a mechanistic and atomistic psychology that ignores the variety, 
subtlety, and organic unity of the human mind. But it is by no 
means the sole or necessary logical foundation of all Psychometric 
instruments which purport to deal with general intelligence. It is 
an explicit general formulation of the ideas about Which a certain 
battery of tests was built —L.E.R. Intelligence Scale CAVD. But 
McNemar (1942), for instance, expressly denies a cardinal point 
in Thorndike’s doctrine, namely the very high correlation between 
level and range. Put in other words this means that the kind of 
tasks set up is important to a large extent independently of their 
difficulty, so that a person’s mentality will be revealed better in 
one kind of undertaking than another. This is Why many Scales, 
including that of Binet and its major revisions, actually contain a 
wide variety of subtests. Again Stoddard and Boynton set Up a 
concept of intelligence broader by far than that of Thorndike. 
They emphasize much in intelligence that does not appear at least 
directly in any tests. The reason why these factors are not em- 
bodied in our instruments of measurement is not that they are 
lacking in significance, or importance, or authenticity, but because 
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the means of translating them into valid, reliable, objective and 
intelligibly standardized tests are not available. Nevertheless our 
tests, although limited, really do contain much of the essence of 
intelligence, as their successful application sufficiently shows. The 
reason why L.E.R. Intelligence Scale CAVD is a serviceable instru- 
ment, which in fact it is, is not because range correlates almost 
perfectly with altitude so that in theory only difficulty matters, 
but because it contains an array of tasks within the “range” of 
many and indeed most of the human beings for whom the test is 
intended, and because these tasks really do embody many impor- 
tant characteristics of general intelligence. 


EMPIRICAL CLARIFICATION OF CERTAIN ISSUES 


Some of the basic issues in connection with the conception of 
general intelligence have received a measure of empirical clarifica- 
tion by experimental and statistical analysis. The results are 
Worth noting. 


1. Intelligence and learning capacity 


As will be seen from the definitions quoted above, or by refer- 
ring to the literature mentioned, general intelligence has quite 
often been considered as identical with the capacity to learn. What 
experimental evidence there is suggests quite strongly that any 
Such statements must be severely qualified. 

Johnson (g.v.) gave 60 college students 10 minutes practice a 
day on mirror reading. He obtained correlations between perform- 
ance on this task and scores on a number of standard intelligence 
tests. Mean intelligence score correlated .34 + .08 with an average 
number of words read per diem, and .46 + .o7 with improvement 
over 20 days. He found in addition that those in the upper half 
Sf intelligence improved in mirror readings more than those in 
the lower half. Joseph Peterson (1922) found a moderate correla- 
tion between his rational learning test, shown in Figure 3, and 
scores on intelligence tests.* Jordan (g.v.) administered four 
group intelligence tests and also the Stanford Revision of the 

* In the rational learning test, or “menta. maze,” the subject does not look at 
the diagram. The examiner calls off the letters two at a time, and the subject 
Chooses, his purpose being to reach the goal at H. Whenever an error occurs, as 
in choosing L instead of I, A instead of O, and so forth, he goes back to the 


eginning and starts again with ths pair N V. The test involves a rather unique 
combination of rote learning and comprehension of a general plan or layout. 
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Binet Scale to 64 high school pupils, and also an “ideational learn- 
ing test.” This latter test presented a series of pairs of letters at 
various distances from each other in the alphabet, e.g., M O, BW, 
etc. The letters were indicated by numbers corresponding to their 
serial position in the alphabet. (In the above two instances, the 
pairs would be indicated as 13 15, and 2 23). The task was to learn 
to write the letter midway between the two in each pair. Twelve 
practice periods of 3 minutes each were given during an hour. The 
correlations between intelligence test scores and this rather com- 


Fic. 3. RATIONAL LEARNING TEsT (MENTAL Maze) 
(Peterson, J., 1922) 


Plex piece of symbolic learning were “uniformly low” for the 
various groups used. They centered around .20, and ranged from 
‘3I to —.I125. Smith (g.v.), working with a group of 95 subjects 
Whose average mental Age Was 9 years, gave them practice on 
spatial and perceptual tests, and found the gains made were un- 
related to mental age. Grace McGeoch (9.%.) again has presented 
Some evidence that brighter individuals tend to benefit more than 
the duller by using the whole method of learning. She worked with 
two groups of children, 30 in each, from 9 to 10 years old. The 
mean intelligence quotients of the two Sroups were 99 and 151. 
Both groups were set to learn series of Turkish-English vocabu- 
lary pairs, and also ten-line poems. Three methods of learning 
were used—the part method, the Progressive part method, and the 
Whole method. The whole method was found superior for both 
groups, but markedly so for the brighter. On the vocabulary learn- 
ing the progressive part method was Superior to the pure part 
method for the brighter Subjects. McGeoch accounts for this by 
the fact that the brighter subjects had a better grasp of pattern 
And better mental organization on the Job. Once again, J. A. Mc- 
Geoch (g.v.), Working with children from 9 to 14 years of age, has 
found only a slight relationship between intelligence and the abil- 
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ity to make a correct report of objects and events that have been 
observed. 

This work is evidently none too conclusive. The learning tasks 
are, for the most part, of a very limited and artificial type. Smith, 
for instance, gives what she calls an “operational” definition of 
learning as gain due to practice on a test. Such notions hardly 
correspond to the broad conceptions of “adjustment” and “power 
to acquire new skill and insight,” which those who equate intelli- 
gence with the power to learn presumably have in mind. More- 
over, it seems likely that except in one and perhaps two of the 
experiments, the total range of intelligence among the subjects was 
rather narrow, since college and high school students were largely 
used, the notable exception being the investigation by Smith. This 
Would tend to attenuate the obtained correlations, which there- 
fore might very well underestimate the true relationship. But it is 
clear that there are serious difficulties in the way of an out-and- 
2 definition of general intelligence as meaning the power to 
earn. 

Woodrow (g.v.), in a broad survey of the evidence, makes the 
following points. (a) There is little relationship between practice 
Eains in dealing with ‘spot patterns, rearranging letters, cancella- 
tions, making tallies, etc. (b) As far as school learning is con- 
cerned, there seems to be no very decisive relationship between 
learning in the separate subjects considered one by one and gen- 
eral intelligence. (c) Little or no relationship has been established 
between speed of learning and general intelligence, though the 
evidence here is quite limited. (d) There may very well be a rela- 
tionship between certain mental factors and certain types of 
learning. 

Some years ago Pyle (1925, 1928) on the basis of a correlational 
Study of so kinds of learning, endeavored to show that a general 
learning capacity exists, which he considered as equivalent to 
attentiveness. Later results, however, as Woodrow points out, 
throw much doubt upon the concept of general learning capacity. 
Clearly, we cannot define intelligence as equivalent to anything of 
the kind. What is indicated is a more analytic approach to the 
Problem, for on a priori grounds it would seem almost incredible 
that there could be no close relationship of any kind between 


mentality and learning. 


2. Intelligence and personality type 
Over the period of the past twenty-five years results indicating 


88 PSYCHOLOGICAL TESTING 


a relationship between intelligence and type of personality have 
from time to time been published. Although they have a crucial 
bearing upon the concept of general intelligence itself, and also 
upon the problem of translating it into suitable instruments of 
measurement, they have received little notice in discussions of 
these topics until very recently. 

As long ago as 1920 Wells and Kelley (g.v.) demonstrated that 
the various subtests of the Stanford Revision of the Binet Scale 
elicit different responses in normal and in psychotic persons. For 
both groups the vocabulary test has a high stability, and also the 
subtests which call for immediate memory for digits. But there is 
a marked difference in the tests at the ten-year level calling for 
the drawing of designs from immediate memory and for reporting 
on the thought content of a paragraph read by the subject, and 
also on the Ball and Field Test which is placed at the twelve-year 
level. On these tests Wells and Kelley found that functional psy- 
chotics tend strongly to perform less well than normal subjects of 
comparable mental ages.* Raymond Cattell (1945 b), again, in 
his elaborate factorial studies of personality, finds intelligence 
to figure as a general factor among traits, associated particularly 
with character traits, and more specifically still, with good habits. 

A report by Piotrowski (g.v.) expands the topic. He finds that 
most mental disorders have a selective effect upon Stanford-Binet 
subtest responses. Many of the subtests are more difficult for psy- 
chotics than for normals, and the effect differs with different 
Psychotic categories, i.e., schizophrenics, depressives, paranoiacs, 
and so forth. The differences between these types of personality 
appear as differences in the “profiles” they make on the scale. This 
is a term that requires some explanation. The Stanford-Binet, like 
all the direct revisions of the Binet scale, is an age scale. That 1s, 
its subtests are grouped at stated age levels. In measuring a sub- 
ject, the examiner starts him at a point where he should be able to 
succeed with all subtests, and continues on to the level of failure. 
Before the upper limit is reached he usually fails in certain of the 
tests while still succeeding with others. It is this pattern of suc- 


* The scale referred to is the first of two revisions made at Stanford University 
under L. M. Terman. The second revision is entitled the Revised Stanford Binet 
Scale. The latter appeared in 1937. It is described on pp. 97-118 below and a 
partial synopsis appears in Figure 5. For the Stanford Revision subtests men- 
tioned, see Terman (1916), where they are described in full. The subtest called 


Ball and Field in the Stanford Revision is renamed Plan of Search in the Revised 
Stanford-Binet. 
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cesses and failures that is referred to as the “profile.” A psychotic 
person will have a special tendency to fail in certain subtests, and 
these vary in accordance with the type of psychosis. Thus there 
may appear a characteristic “psychotic profile,” and such may be 
found in subjects who do not manifest a true functional psychosis. 
When this happens, it is clear that there are two consequences. 
First, the scale will underestimate the intelligence of the subject 
as registered in his attained age. Second, it will be possible to 
demonstrate a qualitative difference in his response to situations 
demanding intelligent responses. 

Rapaport, Gill, and Schaefer (.v.) have recently published very 
extensive data on the same subject. Instead of one of the Stanford 
Revisions, they used the Wechsler-Bellevue Intelligence Scale.* 
This is not an age scale but a point scale, made up of 11 subtests 
each much more extensive than the Stanford-Binet subtests, and it 
can be used for ages up to 60 years. The authors report that 
the “scatter” of a subject's showing, i.e., the pattern or configura- 
tion of his weighted scores on the separate subtests, is related to 
his personality type. The scatter pattern differs as between nor- 
mals and psychotics and as between different types of psychotics. 
Psychotics, that is to say, find certain of the subtests particularly 
difficult, and so make a poorer showing on them than normals of 
presumably comparable mentality. Thus schizophrenics (unclassi- 
fied) have no special difficulty with the information subtest, or 
with that involving digit span. Arithmetic, however, is much im- 
Paired. There is little impairment in similarities. Picture arrange- 
ment and picture completion are greatly affected, block design 
only slightly, object assembly is impaired to an extreme degree, 
and digit-symbol somewhat. In general, the authors report such 
Subjects definitely impaired in language tests calling for compre- 
hension, and markedly impaired in performance tests involving 
visual organization. Similar findings are reported for the major 
Categories of paranoiacs, preschizophrenics, depressives, and neu- 
rotics, and their subclasses. The authors also find that many nor- 
mal individuals yield scatter patterns resembling those of the 
Psychotic types to which they correspond. 

Much of this work is controversial, and the details of such 
findings are subject to revision and correction. But the point here 
is that it forbids us to regard intelligence as one uniform function, 


* This instrument is discussed on Pp. 129-136 below, and a synoptic outline is 
Presented in Figure 10. 
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i ical for all persons and situations. It is by no means a mere 
TY alludes or of graded difficulty of the tasks an indi- 
vidual can accomplish. Its manifestations have within themselves 
qualitative differences, and these are associated with the emotional 
and personal make-up of the individual. Thus Thorndike’s as- 
sumption that all tasks of a given range, i.e., of a given level of 
difficulty, are really equivalent, and that the only reason why a 
given person will not perform all such tasks equally well is the 
“accident” of his training, experience, and opportunity cannot 
be maintained. 

In the second place, these findings have a practical bearing 
upon test construction. It will always be preferable to have an 
instrument that can reveal these qualitative differences, which 
may be considered true differences in intelligence, as clearly as 
possible. Such an instrument will block out different types of tasks 
clearly and unmistakably, and embody a wide range of such types. 
The two Stanford Revisions fulfill the latter requirement very 
well. They contain a large number of quite varying subtests. But 
they do not fulfill the former as well as the Wechsler-Bellevue 
scale, in which eleven major subtests are blocked out. This, no 
doubt, is why Balinsky and Wechsler (g.v.) were able to show 
that the Wechsler-Bellevue scale is Superior as a clinical and 
diagnostic instrument to the Stanford-Binet scale. These two 
criteria would also mean that the clinical value of the LE.R. 
Intelligence Scale CAVD is not great, for it contains four types 
of verbal tasks which seem closely related on inspection, and 
which have been demonstrated as such by Thorndike and his 
associates. 

To sum the matter up, an intelligence test may yield a “global” 
or over-all total score, such as a mental age, or a percentile, or a 
standard score, and this may be considered representative of the 
subject's level of intelligence. But such a Score does not tell the 
whole story. There are also qualitative differences in intelligence ; 
and if a test covers them up, it is to that extent defective and in- 
sensitive to the objective facts. 


3. Estimating intelligence 


Considerable light on the conception of general intelligence 
Comes from investigations of the problem of estimating intelligence 
Without the use of tests. It has been shown again and again that 
unchecked estimates by teachers and others are extremely untrust- 
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worthy. Several different estimates of the same person do not agree 
well, and estimates of groups do not agree with their showing on 
standard tests. Pintner has shown that when teachers try to esti- 
mate the intelligence of children in school, they are very much 
influenced by school performance, and that their ratings are closely 
related to the grades of these children (Pintner, 1931, Pp. 2904-95). 
Magson (g.v.) reports an attempt to rate the intelligence of 876 
persons on a 7 point scale, for the purpose of assigning them to 
free places in the British secondary schools. Ratings were made 
On the basis of an interview by judges who were not acquainted 
With the candidates, and their intercorrelations were about .15, 
Which means virtually no agreement or reliability. Once again, 
Webb (g.v.) reports a study in which 104 students were rated for 
intelligence by teachers and fellow students. Most of the ratings 
had a low relationship to results on an intelligence test, and the 
lowest relationship, which was virtually zero, was found for the 
ratings made by men on women. This finding is quite amusing, but 
it is also significant. The difficulty in undirected attempts to rate 
intelligence is the absence of criteria, and for this reason any 
irrelevant influence may have a completely disturbing effect. 
Ruch (1920) and also Varner (1922, 1923) have experimentally 
analyzed the causes of this difficulty, and have explored the possi- 
bility of overcoming them. They find first that a clear working 
Conception of intelligence must be set up, so that other traits will 
be disregarded as far as possible, such a clear conception usually 
being absent. Second, it is necessary, when dealing with children, 
always to consider the age of the person or persons being rated. 
Otherwise there is no way of forming an idea of what may prop- 
erly be expected, and size and grade placement may easily disturb 
the estimate and very often do. Third, it has been shown that 
Older children are easier to rate than younger ones, and that dull 
Children are easier to rate than bright ones. Dullness shows up 
far more distinctively and unavoidably than brightness, which 
may easily be obscured by shyness. And also the unusual character 
Of a child’s reactions may not be realized because his age is not 
taken into consideration. Fourth, it is found that rating within 
a stated grade is much more accurate than are attempts to rate 
children in general. This, of course, is because the grade group con- 
Stitutes a frame of reference on the general principle of a stand- 
ardization group. Fifth, it tends to improve ratings and estimates 
if a reasonable distribution of ratings is imposed on those who do 
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the work. Those who make such estimates usually tend to rate 
very high. Two groups of teachers were instructed to use the 
normal probability curve as a guide in making their estimates, 
each group using a five point scale. One group reported 63 children 
as A, 93 as B, 188 as C, 61 as D, and 6 as E in intelligence. The 
other group reported 28 as A, 54 as B, 36 as C, 12 as D, and 1 as 
E. So even when advice to use the normal distribution is given, 
there is a strong impulse to disregard it. And when no such ad- 
vice is offered, the extreme skewness of the results invalidates 
them. 

The significant outcome of these investigations is not merely 
that it is possible, even without the use of tests, to make a fairly 
accurate and valid estimate of a person’s general intelligence. The 
more important consideration is how such an estimate must be 
made. It must be a properly oriented survey that confines itself, 
as far as may be, to those aspects of a person’s behavior which 
reveal intellectual capacity and which involve intellectual tasks. 
Conclusions regarding a person’s intelligence must be drawn in 
the difference in this respect between him and others, with such 
disturbing factors as his age or the personal impression he makes 
properly discounted. And he must be rated with reference to some 
feasible and reasonable notion of the distribution of general intelli- 


gence, this usually involving the assumption that such a distribu- 
tion will be approximately normal. 


CONCLUSION 


Taking into consideration all the varied lines of thought and 
investigation which have been summarized in ‘this chapter, it is 
clear enough what the concept of general intelligence as it has 
arisen in modern psychology and employed in modern testing 
means. It does not stand for some monolithic, unitary, sharply 
defined mental entity. It cannot be equated to “altitude” of intel- 
lect alone, although the ability to succeed with increasingly diffi 
cult tasks is one of its best indications. Nor is it by any means 
equivalent to Jearning capacity, at least in any very precise or 
specific sense.{When we speak of a person’s general intelligence 
Wwe mean a congeries of mental functions. These it is Possible to 
Survey and appraise, and the result can be expressed in an over-all 
Or global score or rating. But such a score, though meaningful and 
useful, can never tell the whole story, because there will be quali- 
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tative differences that it will not reveal. That is, two persons with 
the same over-all rating may and probably will show different pat- 
terns of strength and weakness in connection with tasks of differ- 
ent kinds. These differences are in part no doubt due to training, 
experience, and opportunity, but by no means wholly so. And they 
also must be considered true differences in. intelligence.’ General 
intelligence, moreover, has many aspects or components which 
cannot be tested, not because they are not important and also not 
because they are intrinsically inaccessible to any kind of measure- 
ment, but simply because the technical means do not exist. Yet 
very important aspects of general intelligence can be measured 
reasonably well, and presumably as tests improve more and more 
aspects of it will become accessible. Thus the latest tests, such as 
the Wechsler-Bellevue scale, or the Kuhlmann Tests of Mental 
Development, seem to uncover more than the earlier and less 
evolved instruments, such as the direct revisions of the Binet scale. 
So it does not seem at all unreasonable to hope for still more 
progress, the general direction of which seems fairly manifest— 
i.e., towards tests which will differentiate better and reveal quali- 
tative differences more clearly and certainly. টি f 

[ “Intelligence,” as Sherman puts it (Pp. 8), “obviously is not a 
Single mental process, but a practical concept connoting a group 
of complex mental processes.” This is how it has been conceived 
from the time of Binet to the present day. Its significance in theory 
lies in the reaction it involves away from a faculty Psychology 
and towards a holistic or configurationalist psychology! Its signifi- 
cance in practice lies in the emphasis it implies upon broad and 
comprehensive survey-like ratings of mentality, instead of the 
testing of very narrow specific functions. In spite of its vagueness 
it has proved exceedingly fruitful). b- 

However, in the light of this account, the significance of the pres- 
ent-day movement towards a more analytic treatment of mental 
organization should be very evident. The attempt is being made 
to get away from the admitted vagueness of the notion of general 
intelligence, to define and isolate definite mental factors, such as 
inductive thinking, grasp of spatial relationships, numerical think- 
ing, and the like, and to build tests which will reveal them, 
There is a trend towards tests which will yield profile scores on 
mental factors in place of the “global” score on general intelli- 
gence. The reasons for this development are abundantly clear, but 
its outcomes are not yet fully established. 
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QUESTIONS FOR DIscUSSION 


1. Does the fact that general intelligence is loosely defined invali- 
date it as a concept? If so, consider some other psychological concepts 
that would be invalidated. 

2. Does the fact that the concept of general intelligence arose in a 
practical setting and has been put successfully to practical uses seem 
to validate it theoretically? 

3. To what extent does it seem possible to reconcile the definitions 
of general intelligence that have been quoted? To what extent do they 
conflict? 

4. What inferences in regard to general intelligence might be drawn 
from the varying treatments indicated in this chapter, and also those 
TE in “Intelligence and its measurement: a symposium” 
(G.v.)? 


5. Consider some of the practical meanings for the guidance of a 
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human being of Thorndike’s claim that range correlates perfectly 
with altitude. 

6. Which of the aspects of general intelligence mentioned by Stod- 
dard seem to lend themselves peculiarly to measurement? Which do 
not? In considering this question, refer to the discussion in Chapter 1 
of this book on the limitations of tests. 

7. Boynton and Stoddard both insist on the importance of socially 
significant tasks in revealing intelligence. Could this emphasis be 
translated into instruments of measurement? 

8. In what respects is the description of general intelligence offered 
by Thorndike “atomistic”? 

9. Do the investigations on the relationship between intelligence 
and learning invalidate any of the definitions cited in this chapter or 
found elsewhere? 

10. Cite and discuss instances from your own experience of qualita- 
tive differences, i.e., differences in the kind of tasks that can be done, 
as between persons more Or less on the same level of intelligence. 

11. Could artistic creation be considered as a type of intelligent 
activity? ন 

12. Applying so far as you can the safeguards and conditions men- 
tioned in the chapter, make estimates of the intelligence of some 
persons known to you, and if possible compare those ratings with 


their tested intelligence. 


CHAPTER IV 


SCALES FOR THE MEASUREMENT OF 
INTELLIGENCE 


THE MEANING AND IMPORTANCE OF INTELLIGENCE SCALES 


This chapter deals with the most important scales for measuring 
intelligence. The word “scale,” although widely used in mental 
measurement, is not at all precise. Usually an intelligence scale 
can be applied to a wide range of ages, perhaps from 10 to 60 
years) But there are exceptions, as for instance the California 
First Year Scale, intended for the first year of life. It is Sometimes 
assumed, without being explicitly claimed, that an intelligence 
Scale is for individual administration, and in fact all the scales pre- 
sented in this chapter are for such use, with one exception, and 
even that one is partly individual. Yet the term is often applied 
to batteries of group tests. So the concept is not at all clear-cut, 


cation, and its 


radical revision, and culminates in the Appearance of definitely 


new orientations. 


THz Worx oF BINET 


As was pointed out in the previous chapter, the work of Binet 
in the field of psychometrics stemmed in general from a lifelong 
interest in the problem and specifically from the assignment to him 
of a major practical task. His first attempt to translate into an 
instrument of measurement the idea of general intelligence that 
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had been growing in his mind for at least ten years is shown in 
Figure 2, which outlines his earliest syllabus of mental tests. This 
appeared in 1905, but it was not until three years later that all the 
major characteristics of his method appeared. These were em- 
bodied in the 1908 scale, partly summarized in Figure 4. Simply 
Put, it consists of a conglomerate of tests classified into age 
levels. Such was the first systematic translation of the conception 
Of general intelligence into the items and layout of a comprehen- 
Sive instrument of measurement. 

Before his death, Binet made one further revision of his tests, 
Which appeared in 1911. Various subtests were added. Some were 
eliminated. Others were shifted to new age classifications. Al- 
though Binet died with his work incomplete, the ideas and meth- 
Ods he elaborated and the resulting scales were adopted and re- 
Vised in many lands. The account that follows will be confined to 
the American work, which has been very fruitful and thorough. 
‘To repeat, it forms a coherent story. The expansion, revision, and 
Partial supersession of the ideas and practices originating with 
Binet do not follow a strict chronological order, to be sure. Some 
Very far-reaching departures were at least suggested and to some 
extent put into effect very soon. But the whole great body of work, 
extending over more than thirty years and enlisting the efforts of 
Some of the ablest men in the field, is a logical development which 
Constitutes, for better or worse, the very core of modern psycho- 
metrics. 


EXTENSION OF BINET’'S WORK: THE STANFORD REVISIONS 


1. Stanford Revision of the Binet Scale for the Measurement 


of Intelligence * ল 
2. Revised Stanford-Binet Tests of Intelligence + 


These two revisions, published in 1916 and 1937, Were not the 
first to appear in this country, but they are major landmarks in 
Menta] testing. The work had ample financial backing and is an 
Outstanding model of test construction. It is definitely an exten- 
Sion of Binet’s ideas, rather than a departure from them. Our 
Concern will be with the second revision, and reference to the first 


Will be chiefly for the sake of background. 


bs References: Terman I9I6; Terman et dl., 1917. 
References: Terman and Merrill, 1937 a, 1937 b; McNemar, 1942. 
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AGE HII 


pwr 


Pointing to nose, eyes, and mouth. 

Repetition of short sentences. 4 
Repetition of two digits. t 
Enumeration of objects in pictures. 

Knows last name. 


AGE V 


I. 
2. 


Compares two boxes of different weighte. 
Copies square. 


3. Rectangular card cut diagonally to be reconstructed accord- 


4. 
Ee 


ing to a similar uncut card. 
Counts four coins. 
Repeats ten-syllable sentence. 


AGE VII 


I. 


ৰ 


ex AnD 


Tells what is missing from unfinished pictures. 


Knows number of fingers on one and both hands without 
counting. 


Copy of written model. 

Copies diamond. 

Repetition of five digits. 

Description of pictures. 

Counts thirteen coins. 

Knows names of four common coins. 


AGE X 


Lindl nhs oO od 


Repeats months of the year. 

Knows names of rine pieces of money. 
Uses three given words in one sentence. 
Comprehension of easy questions. 
Comprehension of difficult questions. 


AGE XIII 


I. 
2. 
3. 


Paper folding and cutting (as in 1905 tests). 
Rearranges two triangles in imagination and draws result, 
Differences between pairs of abstract terms. 


Fic. 4. EXCERPTS FROM FIRsT BINET SCALE (1908) 
(After Pintner, 1931, PP. 137-40) 
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1. Characteristics 


A. General characteristics. Both the Stanford Revisions are in- 
struments for individual administration, not group scales or tests. 
The first revision contained go subtests organized into a single 
form. The second revision contains 129 subtests in each of its two 
forms equivalent in difficulty, designated as Form L and Form M. 
Most of the subtests from the first revision are retained in the 
second, although there have been some changes and eliminations. 
For example, there were items in the old Absurdities Test, located 
at age 10, which were found to be emotionally disturbing (e.g, 
Yesterday the police found the body of a girl cut into eighteen 
pieces. It is believed she killed herself.). And these have been 
changed. The new scale covers a wider range of ages than the 
earlier revision. The old scale ran from age 3 to 14, and then 
through “Average Adult” and “Superior Adult.” The new scale 
runs from age 2 through 14, and then through four adult levels; 
namely, Average Adult, Superior Adult I, Superior Adult IH, 
Superior Adult III. The old scale grouped its subtests in 1-year 
steps from 3 through 10, then presented groupings for ages 12 and 
14, and then the two adult classifications. The new scale has group- 
ings at 6-month intervals from ages 2 through 5, groupings at 
1-year intervals from ages 6 through 14, and then the four adult 
levels. Thus it offers an important extension both of content and 
age application. 

B. Items and subtests. For both the earlier and later revisions 
very thorough canvasses for good test items were made. All items 
already in use in intelligence tests were collated and studied and 
additions were suggested. The items were critically scrutinized, 
and those that passed the primary selection were tried out on a 
standardization group of 3784 Persons. In this work tests were 
given in 17 communities located in 11 states, representing the East, 
the South, the Midwest, and the West. Rural, urban, and occupa- 
tional groups of wide variety were represented in the standardiza- 
tion. Urban representation Was somewhat disproportionately high, 
and compensation for this was made at a later stage in the process 
by setting the median intelligence quotients for the age groups at 
Slightly over 100. Occupational representation was held propor- 
tional to the distribution of occupational populations in the country 
as a whole as shown in the United States census. Only American- 
born white children were included in the standardization. This was 
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YEAR ITI 
+ (6 tests count 1 month each or 4 tests count IY months each) 


Place three small blocks into similar holes in a board.* 

Point to toys when their names are given. 

Point to parts of a large paper doll when parts are named.*¥ 
Build a four-cube tower after demonstration. 

Name common objects shown in separate pictures.* 

Use a two-word sentence spontaneously (e.g., See kitty) .* 
(Alternate: Obey simple commands to manipulate small toys.) 


E.G $5 Ft 


YEAR III 
(6 tests count t month each or 4 tests count 12 months each) 


Obey simple commands to manipulate small toys.* 
Name common objects from separate pictures.* 
Point to the longer of two sticks. 

Name at least three objects shown in one picture. 

+ Point to objects to indicate use (e.g., Show me whi. 

out 0f).* 

. Tell what to do in common situations.* J 
(Alternate: Draw a cross with a pencil after demonstration.) | 


জক 


ch one we drink 


o\ 


YEAR VI 
(6 tests count 2 months each or 4 tests count 3 months each) 


1. Define five words orally by description, use, 

2. Make a simple bead-chain pattern from 
stration.* 

Tell what part is missing from pictured objects. 


Or classification + 
memory after demon- 


[2 


4. Select certain numbers of blocks from a pile.* | 
5. Point to one of five pictured objects which is different f 

fest* Tom the 
6. Draw a pencil line through a simple maze to m 

path. Ake the shortest 


YEAR X 
(6 tests count 2 months each or 4 tests count 3 mont 


hs each) | 
1. Define eleven words orally.* 
2. Explain why the pictured actions of a Person are foolish 
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3. Read a passage of 48 words and recall from memory a consider- 
able part of it. 

4. Give two reasons in support of an oral statement.* গ* 

5. Name as many disconnected words as possible in a minute.* 

6. Repeat six digits after one oral presentation.* 


AVERAGE ADULT 
W (8 tests count 2 months each or 4 tests count 4 months each) 


Define twenty words orally.* 

‘Transcribe a short passage in a code which is exposed.* 

Give differences between two abstract words.* 

Read short arithmetic problems and answer without using paper 
and pencil. 

5. Tell what proverbs mean in own language. 

6. Give oral solution of a practical mechanical problem presented 


ESB 


orally.* 
7. After one oral presentation repeat 24-syllable sentence without 


Error. 
8. Tell in what way pairs of words are alike. 


SUPERIOR ADULT 
(6 tests count 6 months each or 4 tests count 9 months each) 


I. Define go words orally.* | hj 
2. Read aloud a problem concerning direction and distance traveled 


and give answers without using paper and pencil. 


3. Give opposites of words.* ll f 

4. Watch examiner fold and cut a piece of paper, then make a pencil 
drawing to show how paper would look unfolded.* 

5. Read silently while examiner reads aloud a simple geometric pro- 


gression problem, then give answer without using paper and 


pencil.* | 
6. Repeat 9 digits after one oral presentation. 


+ The asterisks indicate tests to be used as abbreviated scale. 
SNM MEL eter 


Fic. 5. SAMPLE OF THE REVISED STAN. 
INTELLIGENCE 


FORD-BINET TESTS OF 


(Adapted from Terman and Merrill. 1937 b. after E. B. Greene) 
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procedure of extraordinary thoroughness. To give some idea of 
what was involved, the final selection of items required Choice from 
30,000 cards used for tabulating data on testing run in the various 
communities. It is interesting to compare this with the standardi- 
zation of the former revision made in 1916. Here the work was 
done with a group of about 1000 children which included all those 
of suitable ages attending a school in a typical middle-class com- 
munity, the school being the only one in the community and en- 
rolling all the children. This earlier standardization has been 
criticized as somewhat too high, because the group was drawn 
from a community in California, a state in which mean regional 
intelligence is above that of the nation as a whole. Such a choice 
would have the effect of making the norms of the Stanford Revi- 
sion somewhat unduly exacting, and of making obtained mental 
ages and intelligence quotients unduly low. 

The probable validity of the items finally selected for the meas- 
urement of intelligence was determined by the use of a variety of 
criteria (Terman and Merrill, 1937 b; McNemar, 1942). (a) 
First there was the general opinion as to their worth formed by 
the corps of expert workers who constructed the scale. An im- 
mense variety of items were discussed and analyzed, and only the 
best survived this preliminary selection. (b) A second criterion 
Was the increase in the percentages of children who Succeeded 
with each item at increasing ages. Only those were retained which 
showed a rising gradient of successes at older ages. (c) A specially 
devised discrimination quotient was worked out and applied, which 
Was based on the differences of the ages of those who passed and 
those who failed each subtest. (d) The correlations of each subtest 
with the composite total scores on the two forms L and M was 
used as a selective criterion. Subtests that did not show a satis- 
factorily high correlation were rejected. This, clearly, would tend 
to make the scale a homogeneous instrument. 

As both Terman and McNemar point out, this account of the 
Processes of item selection is an effective answer to critics of the 
scale who claim that the only criterion used was the increasing per- 
centages passing on each item at successive advancing ages. It is 
hard to see how anything better could Well be done to secure a 
preliminary validation of the items to §0 into any test. The crucial 
point, of course, is the Judgment and choice of the Workers who 
made the test. They were governed by a certain Working concep- 
tion of general intelligence and its manifestations; and if this 
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Were at fault, everything would fail. For the other criteria are 
internal and turn on the logic of the instrument itself, always 
depending on the basic assumption that it really is a valid means 
Of revealing intelligence. 

There were in addition certain secondary considerations which 
were kept in mind in the selection of items and subtests. The work- 
ers desired to construct a scale that would be as easy as possible 
to score. Other things being equal, they chose items which were 
interesting to children and which were of varied type. Economy of 
time, too, was a point considered. It was desired to keep the time 
limit for testing within 75 minutes for the older subjects and 
Within 5o minutes for the younger ones. 

As to the type of items which emerged finally from this extended 
Selective process, some idea may be formed from the partial synop- 
tic outline in Figure 5, but if possible the reader should examine 
the scale itself. Many of the subtests involve the use of pictures, 
Objects, the ability to indicate parts of the body, and so forth. 
These nonverbal and performance subtests occur with particular 
frequency at the earlier ages. At the higher levels there is a larger 
Proportion involving abstract verbal and numerical processes. 

mmediate memory for words is a type of subtest which appears 
at many places in the scale with varying degrees of difficulty, of 
Course. The use of performance items meets to some extent the 
criticism which was made regarding the earlier revision, to the 
effect that it was unduly verbalistic. Indeed some critics seem to 
feel that the present scale has gone too far in the opposite direction 
rt, 1939). < 

C. Scaling and standardization. As has already been pointed 
Out, the Revised Stanford-Binet scale is an age scale. That is to 
Say, the tests are grouped in terms of age levels. In this respect 

erman has continued to follow the pattern set by Binet as long 
Ag0 as 1908. There has been considerable criticism of this decision 
On the ground that improved modes of test construction which 

ave developed in recent years were ignored. We shall take up 
€Se consi i on. 

i considerations later OU. jg the subtests to the proper age 
Classifications was as follows. The standardization group already 
described in general yielded about 100 subjects at each half-year 
interval from ages IT through V, and about 200 subtests at each 
Year interval AL ages VI through XIV. It has already been pointed 
Out that the term mental age may have two different meanings. 
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It may mean the chronological age at which a given score is the 
mean score, or it may mean the average chronological age of all 
those making a given score. Putting it otherwise, one may take 
the scores on the test as the base line and refer the chronological 
age values to them, or one may take the chronological ages as the 
base line and refer the scores to them. In statistical terminology 
this again means that mental age may represent the regression of 
age on score, or the regression of score on age (v. Thurstone, 
1921; Otis, 1916). The choice of the workers in this case was to 
base mental age on the chronological age of those achieving a 
given test performance.* That is to say, they fitted test perform- 
ances to the chronological ages of those who made them. 

This might have been done simply by assigning each subtest 
to the age level at which it was passed by 50% of the age group 
of standardization subjects. For instance, Subtest 3 at Age X re- 
quires the subject to read a passage of 48 words and then to recall 
a considerable amount of it. This might have been standardized 
at the ten-year-old level because approximately half of the 200 
children in the standardization group who were in the ten-year- 
old category were able to do it. The procedure has been used else- 
Where in test construction, but it was not the one employed here. 
Actually this one was an elaborate cut-and-try, trial-and-error 
job of manipulation in which subtests were experimentally shifted 
from one age level to another until finally in each case the median 
mental age of each age group would correspond to its median 
chronological age. 

As will be seen from Table 9, the actual percentages of the 
different age groups passing the subtests for each age level in the 
final layout are not constant. They range from 77% of the four 
year olds to 37.4% of superior adults. The workers who con- 
structed the scale had a number of reasons for arranging the sub- 
tests in what might seem a complicated and indirect manner. 
Decidedly their most important reason, however, and the one of 
interest here, was their belief that the real value of increases in 
mental age diminishes as age advances. That is, they thought of 
the process of mental growth as going very rapidly early in life 
and then slowing up, so that the true difference between a mental 
age of 13 and one of 14 would be less than that between 4 and 5, 

* The two methods of determining a mental age will give different values 


except under one particular condition, i.e., when there is a perfect correlation 
between test scores and chronological age. 
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let us say. Suppose that the true difference between mental ages 
4 and 5 is 20 points, and can be expressed as the difference be- 
tween 100 and 120 points. Suppose again that the true difference 
between mental ages 13 and 14 is 10 points, and can be expressed 
as the difference between 150 and 160 points. Then if the scale is 
set up so that the subtests are exactly right for children 4 years 


TABLE 9 


PERCENTS PASSING FOR VARIOUS AGE GROUPS, STANFORD REVISION OF 
THE BINET SCALE 


(Quoted from Terman, et al., 1917, Table 43, P. 158) 


Average Percent 
Year Group Passing 


77.0 
71.3 
70.8 
68.0 
63.2 
62.3 
gs 64.5 
+ 0০:০০:০০ ৫০৬৫ 62.4 

. 55.6 
59.8 
37.4 


Average Adult 
Superior Adult 


old and 13 years old, a smaller percentage of the 13 Jas olds 

must pass them than of the 4 year olds, because the “ceiling,” or 

the next classification, is much closer i 1 ET 

In actual i ere manipulated an Ig 

i ractice, the tests were Ie. g j 

into the a rs scale Bo that the median intelligence quotient for each 

t at just over IO, this being done to 

compensate for the inadequate sampling of aid ha 

Standardization group. It should be noted here : a I: 

Mean a manipulation of the OE ; a Co 

tained intellig jento ANY Cbd wou CAS 

ence quotien L 

Stant. This LEE has been made from time RR En a j 

ased on a misunderstanding. The purpose of the Pp 
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to insure that the median intelligence quotient of approximately 
100 should always have the same meaning at all age levels in 
terms of test performance. This is absolutely necessary if the 
intelligent quotient is to be used at all, for it cannot stand for 
one set of relationships and performances at one level and some- 
thing different elsewhere. But whether the actual intelligence quo- 
tient of any child will vary or remain constant remains an open 
question. When a foot rule, which is a physical scale, is set up so 
that every inch is equal to every other inch, this does not mean 
that every object measured will always keep the same length under 
all circumstances, but merely that the measurements themselves 
will always have a constant meaning. 

D. Administration and scoring.* The indicated way of using the 
Scale is to begin well below the probable mental level of the child 
with whom one is dealing. Specifically the testing should start low 
enough on the scale so that the subject will achieve 6 consecutive 
successes on the subtests, and it should be continued not to the 
point of the first failure, but until 6 consecutive failures have 
occurred. Thus the child does not earn a given rating by nothing 
but the tests at the given age level but by certain successes at that 
level, plus successes in preceding years, minus certain failures 
above the given age level. A clear grasp of this scoring procedure 
will resolve certain not infrequent misunderstandings in regard 
to mental age, which is often supposed to mean a very clear-cut 
Success with all tests on a given age level, with no failures at or 
below it and no successes above. It is the performance within this 
range from the starting point to the highest point achieved that 
provides the profiles that were discussed in the last chapter. 

The scale yields two chief types of scores. (a) First there is the 
mental age, expressed in years and months, and commonly re- 
ported in the symbolism 8-5 (eight years and five months), etc. 
This is an indication of mental maturity. Readiness for first grade 
entry, for example, would be considered to depend on the child’s 
mental age in the first instance, rather than on his intelligence 
quotient, in so far as it is determined by measurable mentality. 
(b) Then there is the intelligence quotient, the basic idea of which 
is that age must be considered if one wishes to form an opinion of 
a child’s brightness. This is a very common-sense notion, as every 
parent who has boasted of the wonders his young hopeful can per- 


* See Terman and Merrill, 1937 b. Terman and Merrill, 1937 a, is a more con- 
cise account of administrative and scoring practice. 
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form at a very early age should readily understand. All that is 
done by the intelligence quotient is to transform such evaluations 
into numerical scores by dividing the child’s mental age by his 
chronological age. 

A special problem arises at the upper levels in connection with 
these age scores for the reason that Terman and his fellow workers 
found that performance on the subtests ceased to improve regu- 
larly beyond a certain age level. Thus there seems to be a point, 
Or age of arrest, beyond which mental age does not increase al- 
though of course chronological age increases just as before. Since 
a mental age up to 14 means specifically the representative mental 
test performance of a given chronological age group, and since 
the intelligence quotient is derived from the mental age and de- 
pends upon it for its meaningfulness, the difficulty is plain. In 
the earlier revision 16 was decided upon as the age of arrest, and 
the intelligence quotients of all persons of 16 and over were cal- 
Culated on this figure as the denominator. This practice was ad- 
mittedly somewhat arbitrary and open to a number of questions, 
and in the revised scale it has been considerably modified. In com- 
Puting ages for the sake of working out intelligence quotients, all 
Chronological ages from 13 to 16 are scaled down by 18 of the 
excess over 13. Thus a chronological age of 14 years is figured 
at 164 months instead of its normal full value. This is arrived at 

y taking 156 months, which is the full value for 3B Years adding 
12 months for the succeeding year, and subtracti g V3 0 ঢু ac- 
cording to formula, which is 4 months, giving a a 164 
months. So again a chronological age of 15 years is figure on 
Months. In computing adult intelligence Se | ie, 
beyond the chronological age of 16, the base line or denominator 
Le ise solution, and in a 
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2. Special Problems: Criticisms: Appraisals 


The Stanford-Binet scale has such a focal position that a just 
appraisal of it is of great importance. It has constantly been used 
as a reference point in the construction and validation of other 
Psychometric instruments. The methods used in its construction 
have been widely copied. And there is hardly a question connected 
with mental testing and its outcome upon which the results and 
principles of the Binet scale do not have some bearing. 

A. Problems connected with scoring. (a) Both the M.A. and 
the I.Q. have double meanings. We have seen that mental age is 
conceived as the representative test performance of a given stand- 
ardization group. This interpretation is Consistently followed up 
to C.A. 13, and almost so up to C.A. 16. On the plan described 
above for upper age levels, C.A. 16 is figured as 15, i.e., 13 years 
plus %3(16-13), or 15 years. Up to this point, then, there is a direct 
Coincidence between test performance and chronological age 
grouping, but not beyond. So an Assumption was needed. This 
Was that the distribution of adult I.Q.’s would be the Same as those 
from ages 5 to 10, and the tests for these upper levels were assigned 
and scaled to bring this about. Clearly then all mental ages for 
persons above this dividing line, and consequently all intelligence 
quotients derived from them, are hypothetical, inferential, and 
consequently open to doubt. A great many workers in applied psy- 
chology have recommended that the intelligence quotient as re- 
vealed by this particular scale and its techniques should not be 
used at all for persons above the age of 16. Another way of look- 
ing at the same difficulty is to say that the age of arrest fixed at 
16 is extremely hypothetical and dubious, and probably a function 
of the scale rather than a Phenomenon of mental growth. When 
the intelligence quotient of a person 20 or 30 years old is calculated 
as if he were 16 years old, something quite obviously questionable 
is involved. Symonds (1927), for instance, recommends that in 
dealing with individuals in senior high school and beyond, intelli- 
gence quotients should not be figured. Instead, the test perform- 
ance of any such person should be compared with that of the 
group in terms of which he is to be rated by means of Percentile 
rankings, or standard scores, or some similar statistical device. 
And the suggestion has been frequently repeated and put into 
effect. 

(b) A grave question has been raised as to the statistical sta- 
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bility of the intelligence quotient. We have seen that the scale is 
set up and the subtests placed in such a Way that an I.Q. of 100 
always has the same meaning at all ages. What it always indicates 
is the median test performance of the age group concerned. This 
being given the mental age value of the median chronological age 
of the group, it always yields an intelligence quotient of r1oo.* 
But this does not in any way prove that I.Q.’s other than 100— 
IQ.’s of 120 or 80, let us say—vwill also have the same meaning at 
all age levels. What if an IL.Q. of 120 is 1 standard deviation above 
the median at the age of 6, and 2 standard deviations above it at 
the age of 12? Far more people would make this I.Q. score at 6 
than at 12. Or putting the case in other words, an 1.Q. of 120 
Would be easier to get and would mean less actual intelligence than 
What would seem to be the identical IL.Q. at 12. To be sure, such 
extreme fluctuations as that in our hypothetical illustration do 
not occur. But the intelligence quotient is by no means entirely 
Stable at different ages. This is shown from the data in Table 10. 
he standard deviation of the I.Q.’s of the age groups of the stand- 
Ardization group on Form L of the scale is 20 at the age of 12, and 
12.5 at the age of 6. So an I.Q. 2 standard deviations below the 
Median at 6 would work out at 100 — 2 X 12.5 = 75, Whereas at 12 
it would work out at 100 — 2 X 20 = 60. The same relative per- 
Ormance, that is, would receive very different ratings. A rough 
Braphic representation of what this analysis means is presented 
In Figure 6. It shows what happens to three intelligence quotients 
Which are respectively 1, 2, and 4 standard deviations above the 
Mean at the age of 2. Reading from Table 10, we see that the 
Standard deviation for age 2 is 16.7 on Form L. So the three IQ's 
Would Work out at 117, 133, and 167 (approximately). a 
Uctuations are shown by the three curves drawn I 
Points plotted 1, 2, and 4 standard deviations above EE 
Successive 1 year age levels, according to the data সী 5 es 
able 10. It is quite clear that there is a considerable ne 
Td that it is more and more marked the more extreme ঢ় le চী 
Urther consideration of this highly important topic oo r ] St- 
Poned until scores alternative to the ee EE 
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TABLE 10 


STANDARD DEVIATIONS OF I.Q.’S OF AGE GROUPS ON REVISED 
STANFORD-BINET SCALE 


(Quoted from Terman and Merrill, 1937, Table 7, Pp. 40) 


t Stand 
Chronological Numbers in are Stn 
bs eGo Deviations of Deviations of 
ge 8 I.Q’s on Form L |I1.Q:s on Form M 
2 102 16.7 15.5 
22 102 20.6 20.7 
3 99 19.0 15.7 
3% ! 103 17.3 16.3 
4 Ios 16.9 15.6 
4% IOI 16.2 15.3 
5 109 14.2 I4.I 
5 IIo 14.3 14.0 
6 203 12.5 13.2 
7 202 16.2 15.6 
8 203 15.8 IS.5 
9 204 16.4 16.7 
Io 20I 16.5 15.9 
II 204 18.0 17.3 
4] 202 20.0 19.5 
13 204 17.9 17.8 
14 202 19.0 ৰে 
16.7 
IS I07 16.5 
6 19.3 
102 16.5 17.4 
I7 109 Ts 143 
18 ) 
IOI 17.2 16.6 


liability varies with the size of the I1.Q. For chr i 

6 to 13, the error of measurement তি ক Le from 
to 5.3 for high I.Q.'s. The corresponding deduced reliabl; I1.Q.’s 
efficients are .97 and .9o. Thus general reliability coeffici lity co- 
not be worked out for this scale, or for any such sc 1 can- 
reasonable expectation would be‘as follows. Suppose thr dles. The 
each with a mental age of 100 months, and that the err ee persons 
urement at this M.A. level is 4.42. Then if one of Hernti, চা 
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Fic. 6. FLUCTUATIONS IN THREE INTELLIGENCE QUOTIENTS AT 
VARIOUS AGES 


of 100 months, and thus an I.Q. of 100, this obtained ‘score would 
mean an I.Q. of 100 = 4.2. If the second were C.A. 80, this would 
give him an obtained I.Q. of 125, which would indicate a true L.Q. 
of 125 = 5.2 If the third had a C.A. of 125 months, this would give 
him an obtained I.Q. of 75, and would indicate a true I1.Q. of 
75 £ 3.4 (McNemar, 1942). 


B. The vocabulary test. 


‘This is another major center of controversy. (The vocabulary 
test has a very important place in the scale, and appears at many 
age levels. Far-reaching issues are involved in it. The test itself 
consists of a list of 100 stimulus words selected by arbitrary rule 
from a small standard English dictionary containing in all 18,000 
Words. The purpose of this method of choosing the words was to 
Secure an unbiased sample considered to be of sufficient size and 
at the same time workable in practice. The words are administered 
orally to the subject, who makes an oral response on which he 
iS rated. b 

It has been criticized on many grounds. (a) The selection of 
Words is said to be meager and arbitrary and insufficient to afford 
any true indication of the total vocabulary of the subject (Kent, 
1937). To get an estimate of the person’s total working vocabulary 
is of course the whole purpose. Thus if a subject makes 20 correct 
definitions, he is credited with a total vocabulary of 3600 words 
because the sample of 100 to which he responds is selected from a 
total list of 18,000. So again a ro word correct response would 
indicate a total vocabulary for the subject of 1800 words. There 
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is no doubt that this seems to involve a great deal of confidence 
in a procedure which perhaps follows the strict logic of statistics 
but leaves very much to chance. (b) Clinical Subjects in disturbed 
mental states are apt to find the test difficult and upsetting, be- 
cause they make a great many frustrating failures (Kent, 1037). 
(c) The scoring is unsatisfactory. Subjects may and do try to define 
the stimulus words in an enormous and varied number of Ways. All 
kinds of shades of meaning and delicacies of interpretation arise. 
The instructions for rating and scoring the subjects’ responses as 
given in the manuals (v. Terman and Merrill, 1937 a and b) are 
very full and careful. But even so the bias and judgment of the 
examiner cannot help but play a dangerously large part. It need 
hardly be said that a vocabulary test of this kind cannot possibly 
be scored by a key that assigns only one right response, or a very 
few definite alternatives, for each stimulus word. (d) A person's 
vocabulary is said to depend very largely upon opportunity and 
environment. This is a point often made in criticism, particul 
in connection with the use of the test with children from 
English-speaking homes. There is at least some direct evidence on 
the matter, for in one investigation a list of Words was presented 
day after day without comment to about 45 adults, by writing 
them daily on the blackboard before the beginning of a class, and 
it was found that the subjects made marked gains in a short time 
from this casual and more or less artificial “exposure” (v. Haef- 
ner). (e) The implications of the test are said to be anomalous. 
On the one hand, the assumption is that intelligence Stops advanc- 
ing about the age of 15; on the other, it is known that adults tend 
to gain in command of vocabulary (Christian and Paterson). How 
then, it is asked, can the range of vocabulary be a true test of in- 
telligence if this also is true? (Stoddard, 1943). (f) Lastly, the 
vocabulary test has been criticized as a Sort of “quiz kid” test. 
Stoddard (1943) has expressed himself very emphatically on this 
point, remarking that the real index of intelligence is not to be 
able to answer a large number of Superficial questions, but to deal 
with problems of increasing difficulty, complexity, and abstract- 
ness, which involve higher and higher levels of mental organiza- 
tion. 

Nevertheless Terman and his as 
Pressed great confidence in it and h 
that is, up to a point, convincing. I 
far the most valuable test in th 


arly 
non- 


SOCiates have repeatedly ex- 
ave brought forward evidence 
n fact they claim that it is by 
€ scale, and with children of 


THE MEASUREMENT OF INTELLIGENCE IIS3 


LSlish-speaking parents probably as good as any other three tests 
LQ of (Terman, O16). (a) In many cases it determines the 
ES 8 subject within 10% of that obtained by the use of the 
Ho (b) With 100 children from English-speaking families, 
lar See was above age mentally with a significantly low vocabu- 
high নর and no case was below age mentally with a significantly 
0 ne. (C) Its correlation with the whole scale ranges from .65 
0 .91 within single age groups, i.e., where the spread of ability is 


{ e., W 
to one year C.A. and s0 would tend to lower all correlations 
erman and Merrill, 1037 b). 


tain Its correlation with M.A.’s ob- 
oe ed by the whole scale in the case of 63r children extending 
ie Wide age range was .9I (Terman, 1918). (d) Even with 
a dren from homes where a foreign language 1S used the test is 
EES one ; for while children from Portuguese- and Ttalian-speak- 
I omes were below the norms for American children in the first 
টী years of school, the difference had vanished by the time a 
ental age of 12 had been reached. The inference is that the ac- 
AE of vocabulary is highly indicative and that bright persons 
will Pick it up very fast when opportunity offers, whereas dull ones 
ill not (Terman, 1978). (e) Vocabulary tests, like the Stanford- 
* inet scale as a whole, are very reliable. Six such tests of 60 words 
) oe Were given to 32 children, and yielded intercorrelations of .92 
is .98 with a mean intercorrelation of .94. (f) Growth in vocabulary 
Very regular as mentality develops. The medians for successive 
re advance almost in a straight line. Thus the test 1s admirable 
Or quick and valid surveys of mentality (Terman, 1918; Terman 


And Merrill, 1937 b). 
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measure the intelligence of individual persons, like any other. We 
shall return to this consideration in another connection, as it is one 
aspect of a broad issue in test construction. (b) An age scale of 
this type is criticized as very rigid and as wasteful of time and 
material. Administration from the point of 6 consecutive successes 
to the point of 6 consecutive failures wastes a great deal of time on 
the eliciting of responses that have no bearing on the intelligence 
score. Many of the subtests which are assigned to a single level 
lend themselves to graded use over a wide range of mental ages, 
with appropriate changes in the expected scores. If this had been 
done in the construction of the instrument, it would have been 
much more economical of material, and also much more flexible, for 
as it now stands it must be used in its entirety according to the 
instructions, or not at all. The argument here is in favor of a point 
scale as contrasted with an age scale, i.e., a scale in which each 
subtest contributes to a total numerical score, instead of one where 
certain subtests in a rigid order must be passed in order to estab- 
lish a given mental age. It is interesting to note that in an experi- 
mental situation the Revised Stanford-Binet scale has been trans- 
formed into a point scale, and when it was applied to 44 subjects 
ranging from 8 to 18 years of age, it yielded results much like the 
original instrument, with a saving of about one-third of the time 
(Growdon). i } 

‘To these criticisms of the practical limitations of the instrument 
McNemar (1942) has replied that the “juggling” of items and 
subtests to equate mental and chronological age did indeed take 
place, and that rigidity and wastefulness are indeed involved. But 
he remarks that even the scores obtained on point scales are very 
often if not usually transformed into mental age values, and that 
these are found very intelligible.* It must be remarked that this 
rebuttal is by no means wholly satisfactory, and that it hardly 
dispels the criticisms of an age scale to say that point scores also 
can be read as age norms. The issue of the objection is that point 
scores are more economical to obtain and that a test set up to 
yield them is more flexible. As to Growdon’s reported transforma- 
tion of the Stanford-Binet scale into a point scale, and of the 
advantages that accrued, it must be remarked that practically all 
his claims could be made of the vocabulary test alone, which can 


* For one typical instance of the conversion of point scores into mental age 


norms see Table 8, which hows the transformation in the case of the Otis Self- 
Administering Test of Mental Ability. 
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yield point values for various ages, gives results correlating closely 
to those of the whole scale, and takes much less time. 


D. Composite character of the scale. 


The composite character of the scale has been assailed (Wells, 
1938; Stoddard, 1943). Often the expressions which the critics 
allow themselves to use in this connection are hardly less than 
abusive. It has been called “a hodgepodge,” a “motley collection 
of tasks,” and so on. And it has been likened to a refurbished 
model T Ford, this being a reference to its composite character 
Which Terman took over from Binet and embodied in both his 
revisions. It may very well be that a composite instrument con- 
taining a large variety of types of items and subtests is the logical 
translation into practice of the vague and inclusive yet significant 
concept of general intelligence. 

Still the question of what this “composite” actually measures is 
an entirely fair one. McNemar (1942), using the techniques of 
factor analysis, has shown that in spite of the wide variety of sub- 
tests and items, the scale taken as a whole seems to measure in the 
main one thing; to wit, a “general factor.” But it is legitimate to 
inquire further as to the nature and meaning of this “one thing.” 

Before turning to this, a word is in order here as to the kind of 
Substitute which some of the critics would presumably recommend. 
What many of them would have in mind is an array of tests deal- 
ing with much more sharply defined mental processes and func- 
tions, Such as verbal ability, numerical ability, inductive reason- 
ing, deductive reasoning, and the like. Such clearly defined enti- 
ties are among the outcomes of a certain type of factor analysis. 
McNemar answers very truly that such sharply defined “purified” 
tests are likely to be much less valuable clinically than composites 
Which elicit performance in many different situations and in terms 
Of responses to a varied array of items. However, since tests of this 
latter kind have only just begun coming into existence, there is 
not experience enough as yet to determine their clinical possi- 
bilities, particularly when a sufficiently varied battery of them is 
utilized. And as we have seen, the Stanford-Binet scale has not 
recommended itself very highly for clinical diagnosis and the indi- 
Cation of differences in kind and quality within the total pattern 
of intelligence. A further reply to the alternative suggested by 
Some of the critics is that such “purified” tests as have been made 
and tried out have not yet proved themselves distinctively superior 
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to whether alleged better tests—narrowed, purified, centered more 
closely and explicitly on sharper concepts—will reveal mental 
processes better, only time can tell. 


MopIFICcATION OF BINET'S PRACTICES: PoINT SCALES 


Even before the first Standard Revision appeared, certain quite 
different developments, based on the work of Binet, were in the 
making. The changes contemplated were the dropping of the age- 
wise organization, and the use of the same subtest in many in- 
stances at different age levels with different scoring standards. 
The result would be a point scale rather than an age scale, i.e, a 
scale yielding a point score or scores rather than a mental age, 
although the point score might be transformed into an equivalent 
M.A. The Point Scale for the Measurement 0f Intelligence by 
Yerkes and Bridges (1916; revised 1923) is often considered the 
Pioneer instrument of this type. It seems better, however, to select 
as an illustration a scale still in current use. 


1. Herring Revision of the Binet-Simon Tests # 


This, like some other point Scales, is essentially a modific 


ation 
of the ideas and practices of Binet, rather than a radical depa 


rture 


FE TEES 
GROUP A 


I. Tell me what you see in this Picture. (4 pictures presented in 


series) 
2. In the first row of numbers tell me what two numbers should 
come next — — (here and here). Go ahead. (8 such rows 


for number completion) 
3. Read this to yourself. Then begin at the beginning and tell me 
everything you have read. (a Passage with 13 “memories” 
4. Iam going to say some numbers. When I am through, say the 


numbers backwards. (digit Broups ranging from 2 to 9 
numbers) 


GROUP B 
5. Showing knees, fingers, ear, foot. 
6. Repetition of 6- and 7-syllable sentences. 
7. Pointing out the larger figure in each of 3 pairs of figures. 


* References: Herring, 1922, 1923. 
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- Aesthetic discrimination, 4 pairs of faces. 


Naming black, gray, white. 
Giving the solution to 6 problem situations. 


+» Reproduction of thought. 17 “memories.” 


Definition of 7 abstract words. 
Reproduction of thought. 12 “memories,” very difficult reading. 


GROUP C 


14. 


Give solutions to 5 problem situations. 


15." Detect absurdities in 8 statements. 
46. Building 4 sentences of 3 words each. 
17. Giving rhyme for 4 words. 
18. Picking out similarities in 6 groups of 3 things each. 
19. Interpretation of 5 proverbs. 
20. Reproduction of thought. 13 “memories,” rather difficult 
reading. 
21. Read 3 scrambled sentences. 
22. Solve 3 arithmetical problems. 
GROUP D 
23. Repeat 4 sentences of 10 to 13 syllables. 


24. 
25. 
26. 
3 
28. 
20. 
30. 


Directions test. 

Directions test. 

Similarities between 4 groups of 3 things each. 
Generalize from 4 separate but related statements. 
Comprehension of 2 verse passages. 

Sentence completion. 

Problem reading and solving. 


GROUP E 


bs} 
নহ: 
33. 
34. 
35. 
36. 
37. 
38. 


Name 5 familiar objects. 

Comparison of forms. 

Perform 3 commands. 

Diagram problem solving. 

Forward repetition of digits, 2-10 in various series. 
Repetition of 3 sentences of 19 to 24 syllables. 

Detection of proportional relationships from material read. 
Code writing. 


———— 


Fic. 7. HERRING REVISION OF THE BINET-SIMON SCALE. 


SYNOPTIC OUTLINE 
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from them. The synoptic outline is shown in Figure 7. A compari- 
son with the Stanford Revisions will be found instructive, for it 
embodies many of the changes which are being recommended at 
the present day, though in this particular early instance they 


One notices a considerable similarity in the material, for many 
of the subtests are derived or revised from the original Binet items. 
But there is a fundamental change in their arrangement and treat- 
ment. As will be seen, they are subdivided into five groups. When 
the test is given to a person, the groups are treated as cumulative, 
i.e., each one consists of the preceding plus the new material. At 
the end of each group there are instructions to omit various tests 
in the new material not yet taken Up if it is evident that the child 
will pass or fail with it. This shortens the total time needed for 
testing, but that time is still very long, and the instrument is 
cumbersome. The material, moreover, is highly verbal, this corre- 


scales. But certain advantages do seem clear. A point scale, if 
properly constructed, can be flexible in the sense that it need not 
be used in its entirety, since certain subtests can be Picked out at 
the discretion of the examiner and still yield significant Scores in 
which the meaning is clear from the standardization. The use of 
the same subtests at different ages economizes material and time. 
It lends itself very well to statistical analysis. And there is much 
force in the claim that point scale subtests can be selected with 
reference to a given psychological function rather than in terms of 
the relationship of success to age. Many clinicians and applied 
Psychologists have expressed regret that Terman did not Convert 
the Stanford-Binet scale into a point scale in the second revision. 


EI ~ 
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Age-Scale Characteristics 


Tests organized by years or other 
age-units. 


+ Tests and items selected by rela- 


tionship of success to age. 


Point-Scale Characteristics 


Single homogeneous graded scale. 


Tests selected in terms of the 
function to be measured. 


3. Varied, unrelated ungraded tests Each test so graded as to be 
in a composite. available over a wide range of 

ages. 

4. Internally standardized and in- Standardized against external 
flexible. criteria and flexible. 

5. All-or-none ratings of subject's  More-or-less ratings of subject’s 
responses. responses. 

6. Qualitative. Quantitative. 

7. Measurements not fully amen- Measurements wholly amenable 
able to statistical treatment. to statistical treatment. 

8. Tests weighted equally. Tests weighted unequally. 

9. Implicit assumption that of new Implicit assumption that of con- 
appearing or emerging func- tinuously developing functions. 
tions. 

10. Measurements for different ages Measurements for different ages 


relatively incommensurable. 


comparable and commensurable. 


MOTT 


Tic, 8. CoNTRASTING CHARACTERISTICS OF AGE SCALES AND POINT SCALES 
(Adapted from Yerkes and Foster g.v.) 


MopiricATIONS OF BINET’'S PRACTICES: THE Worx 
OF KUHLMANN 


Kuhlmann’s name is associated with the best Work in elaborating 
and modifying the practices and ideas of Binet without going to 
the length of radical revision. 


1. Kuhlmann-Binet Scale * 

Before coming to his chief present-day contribution, mention 
must be made of his two important revisions of the Binet scale, 
published in 1912 and 1922. The 1912 revision adhered fairly 


* Reference: Kuhlmann, 1922. 
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closely to the original; but in the 1922 revision, which was the 
result of seven years’ work, considerable and important departures 
occurred, among them the elimination of 19 of the Original sub- 
tests, an increase in the total number of subtests to 129 with 8 in 
each age group above 2 years, extension downward to the age of 
3 months, and credit for speed as well as accuracy. The subtests 
for very early ages included carrying an object to the mouth, and 
binocular coordination determined by fixation On a moving object 
(3 months) ; opposing the thumb in grasping, and reaching for 
objects seen (6 months); initiation of speech Sounds, such as 
mama, baba, dada (1 year); imitation of simple movements made 
by the experimenter (2 years). Arthur (1939) in an investigation 
using 200 subjects found that the Kuhlmann-Binet agreed more 
closely with Stanford-Binet results than did a Stanford-Binet 
retest. This instrument is mentioned because it has great practical 
and theoretical importance as a landmark in Psychometric prac- 
tice. But in the next of his contributions to be discussed, Kuhl- 
mann, though using many of the ideas embodied in it, definitely 
Carried revision a step further. 


2. Tests of Mental Development (Kuhlmann) + 


The Tests of Mental Development embodies many ideas and 
Practices that are of great interest. A Synopsis of representative 
selections appears in Figure 9. Reference to it will help to make 
the description and analysis easier to follow. 


I. Characteristics 


months upwards. 

It consists of 89 subtests and 19 Supplementary subtests, some 
from the original material of Binet, some from his own revisions 
of the Binet scale, and some new material of his own. 

In the preliminary work of assembling and trying out the sub- 
tests, 121 tests were administered to about 15,000 persons from 
months to 60 years old. The choice of suitable tests for the battery 
Was made on the following criteria. (a) Preference Was’ given to 
those which showed large increases in raw scores in the case of 
tests with several elements which were to appear at several levels. 

1 Reference: Kuhlmann, 1939. 


অ 
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Several of these are shown in Figure 9. (b) Preference was given 
to those which yielded large increases in the percentages of sub- 
jects who passed from age to age in the case of tests which were 
either passed or failed. Several of these also are shown in Figure 8. 
(C) Tests were selected in order to give a wide range of raw scores 
in any single age group. This is contrary to usual standards in test 
construction. Ordinarily it is thought desirable to have a minimum 
variability within an age group, but Kuhlmann disagrees with- 
this criterion. His view is that tests which show a wide variability 
are desirable because they will show qualitative differences more 
clearly and thus provide a better diagnostic instrument. (d) Pref- 
erence was given to those tests which showed high correlations 
with total scores on the entire battery. (e) It was thought desir- 
able to include tests with a wide variety of make-up. (f) Prefer- 
ence was given to tests which were as free as possible from the 
effects of coaching, practice, and variable training (Kuhlmann, 
1939). 

B. Standardization and scaling. A standardization group of 
about 3,000 was used, yielding about 106 subjects for each age 
group in the preschool range, and about 140 each year for school 
children. The tests, however, were not Organized into age groups 
Or levels as with the Stanford Revisions, but into a scale of sub- 
tests of increasing difficulty. 

The basis of the scaling is novel and distinctive. It depends upon 
the alleged curve of mental growth developed by Heinis (g..). 
Heinis derived what he considered a standard curve of mental 
growth from various test data on populations ranging in age from 
2 to 18 years. He believed it to represent the true course of human 
mental growth. And he reduced the growth curve to a mathemati- 
cal formula. For instance, his formula indicates 16 degrees of 
mental development between the ages of 16 and 20, 15 units or 
degrees between 20 and 30, and 3 between 30 and 40. The number 
of degrees of mental growth is much larger between earlier pairs of 
ages. This, as will be seen, corresponds in a general way to Ter- 
man’s view that mental age steps decrease in real value as age 
advances. But Heinis went much further than this. He claimed 
that his curve and formula reveal the real course of normal 
development. Kuhlmann has adopted this view, and the tests in 
this instrument were arranged to yield Passing scores at every 3 
Points of true development as shown on the Heinis curve. The 
mental age values which are assigned to the subtests, as is illus- 
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1#* 04 (251) SITTING WITH SUPPORT 


Child is placed in chair supported by back. 
Passed at 21 if he sits up for thirty seconds.* 


25. 1-8 (93) NAMING OBJECTS SHOWN 
Child is shown five objects and asked to name them. 
M.U. — 93 — 1o5* 
R. — 2 — 5* 
35. 2-6 (135) NAMING OBJECTS FROM MEMORY 


Child is shown two objects and asked to name them, then 
told to shut his eyes and name them from memory. 


M.U. — 135 — 153 
R. — I — 2 
44. 3-7 (177) RECOGNITION OF MISSING PARTS IN 
PICTURES 


Child is shown series of pictures with missing parts, and 
asked to name them. 
MU. — 177 — 210 
R. — 2 — 4 
53. 5-1 (228) REPETITION OF NUMERALS 
Repetition of digits given orally, six sets in all. 
M.U. — 228 — 252 
R. — 2 — 3 
61. 6-2 (258) SIZE OF VOCABULARY 
Telling meanings of twenty-five words. 
M.U. — 258 — 297 — 32I 
R. — 5 — 8 — 12 
72. 8-8 (312) FINDING WORD AMONG SIX THAT COM- 
PLETES AN ANALOGY 
A pair of words which establish a relationship, task being 
to choose among six given words the one that establishes 
same relationship to a third word, e.g. finger is to hand as 
toe is to —-. 
MU. — 312 — 345 — 378 


08 99 
নব .05 079 *I05 
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89. 12-6 (363) DRAWING UPRIGHT FORMS IN INVERTED 
POSITION 


Series of designs presented on cards, with instructions to 
draw them as they would appear in inverted position. 


363 = 396 -_ 429 = 462 _ 495 — 528 


—_ ‘012 — .032 — .052 — 072 — 092 — aII2 


Fic. 9. SAMPLE SUBTESTS FROM TESTS OF MENTAL DEVELOPMENT. 
(SCALE CONTAINS 89 SUBTESTS IN ALL) 


(Kuhlmann, 1939) 


trated once more in Figure 9, were worked out by placing the 
Subtest in the age group where it is passed by 50% of the subjects. 
Thus Kuhlmann rejected the rather elaborate trial-and-error 
manipulation and adjustment of subtests which was used in 
assigning them to the appropriate ages in the Stanford-Binet scale. 

C. Scoring. The scoring is unique and somewhat intricate—too 
much so, in fact, for a full account to be repaying. here—although 
the claim is made that when a person becomes accustomed to it he 
finds it feasible enough. For the easier tests, up to number 63, the 
Score is simply the number right. Above that level a combination 
of speed and accuracy is used, the tests being timed to about 2 
minutes and the speed scored by dividing the time the subject 
takes to complete the test by the number of seconds. In order to 
penalize the inaccurate subject the multiplication of speed by 
accuracy is resorted to. Kuhlmann’s general reason for his unusual 
emphasis upon speed and timing is that while in itself it is not 
very significant, it becomes so in connection with exacting prob- 
lematic tasks. The obtained raw scores resulting from these scoring 
Procedures are converted in terms of the Heinis curve into mental 
units, symbolized as M.U., which show the developmental value 

* The various symbols which appear in connection with these tests require 
explanation. They are included here to give the reader some concrete idea of 

Uuhlmann’s method of scoring and of the general layout of the scale. Beginning 
With the top line, the first numeral on the left is the serial number of the test. 
Next appears the age level in years and months at which the test scores first. 
Next appears the equivalent growth curve level in mental units. Below the brief 
description are the scoring indications. Sometimes the score is the number of 
right responses (R). Sometimes it is the number of right responses divided by 


the time taken (R/T). The rating in mental units (M.U.) for various score 
, Values appears immediately above the scores. 
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and meaning of the subject's test performance. Since on the basis 
of the Heinis curve we know what the developmental status or 
M.U. score or average for any given chronological age is, and since 
the test gives the subject's rating in terms of his actual M.U. score, 
it is possible to know his status with reference to average or ex- 
pected mental development. So there is derived a Score which 
Kuhlmann calls the Percent of Average, or P.A., which expresses 
the percentage of average or expected mental development for his 
age which any given person manifests. Kuhlmann regards this 
score as superior to and more meaningful than the ILO. But as will 
be clear from an examination of Figure 9, and from the above ex- 
planation, the instrument can also yield mental ages and intelli- 
gence quotients. 


2. Critique and Appraisal 


A. Correspondence with Binet. The essential point of resem- 
blance between the Tests of Mental Development and the original 
Binet scale and the two Stanford Revisions is that all of them are 
composite unanalytic instruments Containing a multiplicity of 
items. Indeed one might say that the basic operative conception 
of general intelligence embodied in the instrument is even more 
loosely defined than with Terman and Binet. Kuhlmann is evi- 
dently committed to the idea of a diverse over-all Survey of the 
mentality of a subject. In fact he expressly repudiates the notion 


Social standards, and the like (v. Kuhlmann, 1939-40). It is, of 
course, this point of view which explains, and so far as it is valid 
justifies, the composite, inclusive, unanalytic character of the Tests 


B. Divergence from Binet. While Kuhlmann, in the Tests of 
Mental Development, retains the idea of age levels and of a devel- 
opment of mentality related to chronological age which is char- 
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acteristic of Binet, he treats it differently in terms of the Organiza- 
tion of the instrument and in the method of interpreting test 
Performance. The mental age values of the subtests are given, but 
they are not set up expressly as an age scale. Also the use of the 
Heinis growth curve is a very distinctive feature. Binet and the 
Workers responsible for the two Stanford Revisions refrained 
from any particular claims as to the course of mental develop- 
ment. Kuhlmann's scheme of scoring and interpretation depends 
upon a highly specific claim. The idea underlying the M.U scores, 
and of the P.A. or Percent of Average, is that normal mental 
growth is ascertainable and in fact that we know just what it is. 
Without opening up the whole question as to the nature of mental 
growth, which must be considered later on, it is clear that con- 
siderable doubts suggest themselves here. 

C. Reliability. Kublmann’s distinctive views on the subject of 
the reliability of tests must be noted here, for they affect both the 
Organization and the interpretation of the Tests of Mental De- 
velopment. He has from time to time argued that high reliability 
is by no means so desirable as is ordinarily supposed. When a 
measuring instrument is made very reliable, it is not affected by 
Variable errors to any extreme degree. But for Kuhlmann many 
Of the variable influences which can affect test performance are 
not sources of “error” at all, but perfectly legitimate factors that 
Ought to be recognized, and accordingly should be reflected in the 
test score. Among these might be the mood, the attitude, the physi- 
Cal condition of the subject, his fatigue, or his boredom. Granting 
that these influences do affect mentality, one can understand why 
Kuhlmann feels that an instrument designed to override and 
ignore them as far as possible, points directly towards falsifica- 
tion. This is why he declines to enter into the question of the 
reliability of the Tests of Mental Development. 

D. Stability of scores. Kuhlmann makes the claim that his char- 
acteristic measure, the Percent of Average, which he names what 
Heinis called the Personal Constant, is decidedly more stable than 
the Stanford-Binet I.Q. That is, it retains its meaning in terms of 
distance from the mean with less fluctuation at different age levels. 
The probable errors for the Percent of Average at different age 
levels are shown in Table 11. At first inspection, at least, it would 
Seem that these variations are much less than those reported for 
the standard deviations of the Stanford-Binet Intelligence Quo- 


tients in Table 10. 
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E. Practical values and limitations. The discussion of যদ 
leads immediately to a consideration of the practical Values 
limitations of the instrument, for Kuhlmann’s position on Fe 
problem of reliability depends on his Practical orientation. Wi at 
it implies in action is the use of the instrument for diagnosis as 
well as measurement in the sense of arriving at a global total score 
indicative of the subject's over-all mental level. This, too, is why 
he expressly prefers a composite loosely integrated instrument 
rather than one built logically on a well-defined concept. 


TABLE 11 


PROBABLE ERRORS * OF PERCENTS OF AVERAGE AT VARIOUS AGE LEVELS 
(Adapted from McNemar, 1942, Table 51, Pp. 162) 


Probable Errors.. 9 8 7 9 7 9 8 8 


Bearing this in mind, the Tests of Mental Development have 
outstanding advantages. The subtests are ingenious and interest- 
ing. They make demands on mental Power, and so seem well cal- 
culated to reveal it. The placement of many tests at several 
different age levels permits one to measure a developing mentality 
by degrees. Also it makes for flexibility in the use of the instru- 
ment. The inclusion of some tests which can be given to small 
groups is noteworthy in an instrument Primarily for individual 
administration. The emphasis on the time factor in connection 
with accuracy and difficulty is another commendable feature. So 
also is the inclusion of considerable nonverbal material among the 
subtests. 

The most serious practical disadvantages are the complicated 
scoring and the administrative procedure in giving the tests, which 
at least seems quite difficult, and which untrained Workers might 
well find baffling. There is no doubt that these factors militate 
against the wide use of the tests. On the theoretical side, by far 
the most serious question turns upon the use of the growth curve 
developed by Heinis as the basis of the norms and of the inter- 
pretations of test performance (v. Pignatelli). 


* The probable error is .6745 times the standard deviation. This must be borne 
in mind in comparing the above with standard deviations of intelligence quotients. 
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FAR-REACHING REVISION OF BINET’S PRACTICES 


1. Wechsler-Bellevue Intelligence Scale * 

This is the outstanding example of an instrument representing a 
far-reaching revision of the practices and ideas of Binet, yet not 
unrelated to them. 


I. Characteristics 


A. General characteristics. This scale is an instrument for indi- 
vidual use. It is applicable to ages from ro to 60 and upwards. It 
is particularly designed for adults and is regarded as the most gen- 
erally satisfactory instrument for the measurement of adult in- 
telligence. 

A synoptic outline is presented in Figure 10. As will be seen, it 
Consists of 10 units or subtests and one alternative. Each test unit 
is applicable to a wide age range, with differential scoring. Thus it 
illustrates one of the distinctive advantages claimed for point 
Scales by Yerkes and Foster. 

The test units or subtests are related and combined to form four 
separate but interrelated scales of intelligence. First, there is the 
Main Individual Examination for ages 10 to 60, consisting of all 
the test units. This can be reduced to 7 instead of ro tests, if so 
doing seems desirable in the light of the adjustment and type of 
the subject. Second, there is the Adolescent Scale for ages 10 to 
16, using the same test units, but with a different standardization. 
Third, there is the Performance Scale consisting of subtests 6 to 
10 inclusive. Fourth, there is the Verbal Scale, consisting of tests 
1 to 5 inclusive, with the vocabulary test as an alternative. ন 

Wechsler (1944) has presented an evaluation of the subtests in 
the light of their relationship to performance on the whole scale, 
and of their general psychological character. (a) As to the general 
information test (no. 1), he points out that any information test 
depends for its value on the use of information items that are com- 
mon knowledge. He finds that his test successfully samples general 
information, but that it is not so successful for those with special 
Opportunities and training. Its correlation with performance on 
the whole scale is .66 for ages 20 to 34, and .68 for ages 33 to 49. 
(b) The arithmetical reasoning test (no. 3), like all tests of this 
type, was easy to make and to standardize. Its correlation with 


* Reference: Wechsler, 1044. 
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1. GENERAL INFORMATION 
Twenty-five information questions to be answered Right or 
Wrong, in order of difficulty. 

2. GENERAL COMPREHENSION 
Ten questions—what to do?—why thus and so? 

3. ARITHMETICAL REASONING 
Ten timed verbal arithmetic problems. 

4- DIGIT REPETITION 
Fourteen sets of digits, ranging from three to nine per set, to 
be repeated forward. Fourteen sets, from three to eight, to be 
repeated backward. 

5. SIMILARITIES 


Twelve pairs of words, task being to indicate In what way 
they are similar. 


6. PICTURE COMPLETION 
Fifteen cards showing pictures each with a part missing, task 
being to indicate what is missing. 

7. PICTURE ARRANGEMENT 
Six sets of cards, each set “telling a story” in sequence, e.g. 


the catching of a fish, the task being to arrange each set in 
proper order. 


8. OBJECT ASSEMBLY 


Three sets of cutouts, which assemble into three objects— 
manikin, profile, hand—task being to put them together. 


9. BLOCK DESIGN 


Sixteen cubes in colors, nine designs on cards, task being to 
reproduce the given designs by means of the blocks. 


10. DIGIT SYMBOL 
Nine divided boxes as shown below, giving digits each with 


corresponding symbol, task being to write correct | 5 jl 
Symbols under sixty-seven numbers. [ x 


II. VOCABULARY 
Subject to tell meaning of forty-two words. 


Fic. 10. WECHSLER-BELLEVUE INTELLIGENCE SCALE. 
SYNOPTIC OUTLINE 
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the whole scale is .63 for ages 20 to 34, and .67 for ages 35 to 49. 
(c) The digit memory test (no. 4) is again easy to handle. But it is 
a poor test. It is retained because it is effective for lower age 
levels, and because it often diagnoses mental defect. Its correlation 
with the whole scale is reported at .51. (d) The similarities test 
(no. 5) is one of the best included. It correlates with the whole 
scale .73. (e) The picture arrangement test (no. 7) depends for 
its effectiveness on the scenes depicted. Thus a picture of a bird 
building its nest might have different discriminatory and other 
values from one of a policeman chasing an automobile. The 
attempt was made to use familiar scenes for the items of this test. 
However, it does not discriminate well in terms of age levels. Its 
Correlation with the whole scale is .51. (f) The picture completion 
test (no. 6) meets all criteria quite well. It correlates with the 
whole scale .61. (g) The digit-symbol test (no. 10) correlates with 
the whole scale .673 for ages 20 to 34 and .697 for ages 35 to 49. 
(h) The block design test (no. 9) is a good test, and is effective in 
Picking out those low in intelligence. It correlates with the whole 
Scale .73. (i) In constructing the object assembly test (no. 8), a 
difficulty was to secure familiar configurations to be assembled. It 
has been retained because it makes significant additions to the 
total score, and because it gives opportunity for the examiner to 
make a qualitative analysis of the subject’s mental processes. Its 
Correlations with the whole scale are .41 for ages 20 to 34, and .5r 
for ages 35 to 49. These correlations are low because there were 
wide deviations in the scores of small groups of subjects in the 
standardization group. (j) The vocabulary test proves to be an 
excellent measure of school achievement and of general intelli- 
gence. It correlates .85 with the scale as a whole. 

A second form was published in 1946, and a special army adap- 
tation, known as Form B, was prepared during the war. 

B. Scaling and standardization. The norms are based upon 
standardization groups of 670 children and 1,081 adults, selected 
from about twice as many persons in proportion to the occupa- 
tional distribution of the white population of the United States. 
There were from 50 to 175 subjects for each age group from 7 
to 70. Most of the adult age levels are at five-year intervals. 

C. Scoring and administration. The scoring scheme of the test 
introduces another unique feature. Since it extends upwards to the 
chronological age group from 60 to 70, it does not yield intelligible 
mental ages. Wechsler in fact is very critical of the concept of 
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mental age (1944). He points out that a mental age has about it 
nothing sacrosanct or mysterious, and that it is simply a score like 
any other score. Moreover, as he says, it 1S Just as anomalous to 
talk about a person 60 years old as having a mental age of 16 as 
it would be to say that a child of 10 had a mental age of 60. For, 
to repeat, a mental age is simply a score. Thus a mental age of 122 
months on the Stanford-Binet scale presumably means 61 items 
credited. 

With this in mind Wechsler disregards the mental age concept 
entirely, and converts the raw scores yielded by the scale into his 
own units of measurement directly. These units he calls intelli- 
gence quotients, but since there is no mental age determination, 
they cannot be obtained by dividing it by chronological age in the 
usual way. In fact, his intelligence quotients are not in reality 
quotients at all. The I.Q. is figured always with reference to the 
distribution of scores of the age group. The I.Q. value of 90 is set 
for the score which is 1 Probable error below the mean score of 
the age group. The I.Q. value of 110 is Set for the score which is 1 
Probable error above the mean score of the age group. 

The probable error (or P.E.), it should perhaps be explained, is 
a measure of dispersion which is ‘6745 times the standard devi- 
ation. When a distribution is normal, that part of it which lies 
between 1 P.E. above and 1 P.E. below the mean will include 
50% of the cases. This is the reason for Wechsler’s choice of the 
P.E. in determining the values of his derived scores. By common 
acceptance the I.Q. range from 90 to I1o includes those of “nor- 
mal” mentality or “average” mentality, which is thought to be 
about 50% of the Population. Thus by defining I.Q.’s of 90 and 
110 as 1 P.E. below the mean and 1 P.E. above the mean, Wechsler 
sets them at levels which may be expected to include 50% of the 
population. The general logic of this procedure is analogous to 
that in working out standard scores as explained above. Standard 
Scores, of course, are based upon the standard deviation as a unit 
in terms of which to measure the distance of any obtained raw 
score from the mean. Wechsler, for the reasons Just given, prefers 
to use the probable error. But there is a fixed relationship to ordi- 
nary standard scoring, because the P.E. is always ‘6745 times the 
S.D. Once the I.Q. values of 90 and 110 have been determined, it 
is an easy matter to work out all other values and their equivalents 
in raw scores. Wechsler presents tables in his book from which 
the I.Q. equivalent of any raw score can be read off. 
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2. Critique and Appraisal 


A. The scoring units. Wechsler’s purpose in his distinctive re- 
definition of the intelligence quotient is abundantly clear. It is the 
Wish to have a numerical score that will always mean the same 
thing at all age levels. An I.Q. of go always means a score which 
1s 1 P.E. below the mean, and this significance is constant. If a 
Person makes an I.Q. score of go when he is 10, and the same score 
When he is 20, and the same again when he is 40, one need not 
ask whether or not some change in his relative standing has taken 
Place which is obscured by the seemingly identical rating. This, as 
We have seen, is not quite the case with Stanford-Binet I.Q.’s. An 
IO. of 150 on Form L of the Revised Stanford-Binet scale does 
Not indicate the same deviation from the mean at the age of 6 
4S it does at the age of 12. The reader should carefully compare 
the data in Table 10 with the material in Table 12 and in Figure 6. 

€ will note the much greater stability of the standard deviations 
Of the Wechsler-Bellevue I.Q.'s, age for age, compared to that of 
the Stanford-Binet I.Q.’s. It can hardly be denied that this indi- 
Cates a considerable superiority, at least in this respect, for 
Wechsler’s scheme of scoring. A given I.Q. obtained by the use of 
his Scale, and derived from the obtained raw score, always has 
the same significance at any age level, in the strict statistical sense 
of representing always the same deviation from the mean. Put in 
less technical language this means that a given LO. will always 
be equally “hard” or equally “easy” to get, or that it will rate an 
individual in the same position relative to other persons in his age 
Category, . g 

‘The use of the term I.Q. for this score is, however, open to con- 
siderable Objection. It was adopted by Wechsler apparently for 
Prudential reasons, and because the word is widely familiar. But 
it means something decidedly different from the accepted sense of 
the expression, both in popular use and in test construction. And 
it introduces an element of confusion into the terminology of 


Psychometrics. IE 
B ten and JOA. LHe reliability coefficients reported 


Wi scale are in the order of .90. 

ত টক তা conception which the test is intended 
to embody is made quite explicit. Intelligence, as Wechsler under- 
Stands it, is the aggregate or global capacity of a person to act 
Durposefully, to think rationally, and to deal effectively with the 
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environment. Wechsler recognizes the importance of nonintel- 
lectual factors in this inclusive capacity. But he Argues that if a 
test reveals enough of what general intelligence is to enable one 
to predict global capacity with reasonable confidence, it is satis- 


factory. 
TABLE 12 


MEANS AND STANDARD DEVIATIONS OF IL.0Q.’S FoR AcE GROUPS ON 
WECHSLER-BELLEVUE INTELLIGENCE SCALE 


(Quoted from Wechsler, 1944, Table 19, P. 122) 


No. Mean Standard 
Age 0f Cases 1.05 Deviations 
Io 60 IoI.25 13.20 
2x 60 100.84 14.10 
I2 60 100.08 13.80 
- 70 100.57 14.70 
14 70 99.93 14.75 
IS I00 100.00 14.57 
16 Io0o 100.30 ISIS 
17-19 100 98.75 14.50 
20-24 160 100.16 13.70 
25-29 195 100.89 14.60 
30-34 140 99.67 15.60 
35-39 135 99.75 15.50 
40-44 9I 100.30 14.80 
45-49 70 I00.07 14.01 
50-54 55 100.50 13.97 
55-59 50 99.1 16.85 
60-69 105 99.84 15.26 


The internal validity and self-consistency of the instrument has 
been ascertained, and the degree of it is reported in the data 
already presented, particularly in regard to the selection of the 
subtests and their correlation with the scale as a whole. 

As to external validation, Wechsler regards correlation with 
similar tests as a minimum requirement in all cases. The obtained 
Correlations in the present instance are shown in Table 1 3; 
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TABLE 183 


CORRELATIONS OF OTHER TESTS WITH WECHSLER-BELLEVUE 
INTELLIGENCE SCATF 


(Adapted from Wechsler, 1944, Table 28, p. 134) 


Nianber Correlations 

with 

Name of Test Wechsler 

SE Bellevue 
Stanford Revision of Binet Scale (1916)... - 75 82 
Stanford Revision of Binet Scale (1916)... ge ‘OL Sr 
Revised Stanford, Form L (1937)... AS 55 gr 
Revised Stanford, Form L (1937)... II2 .62 
Stanford Revision of Binet Scale (1916)... I25 57 
Revised Stanford (1937)... 5 60 .93 
Army Group Examination Alpha 92 74 

American Council on Education Psychological 

Examination for College Freshmen... II2 53 
Morgan Test of Mental Ability... .-. 125 62 
Henmon-Nelson Test of Mental Ability... 50 SI 
LE.R. Intelligence Scale, CAVD..... is SEE 108 .69 
LE.R. Intelligence Scale, CAVD....... -" 60 .39 
Otis Self-Administering Tests of Mental Ability..| 108 3 
Otis Self-Administering Tests of Mental Ability. . 60 53 


Bali Jer (g.v.) have undertaken to compare the 
SRE CE rd Lip le Se scale with that of the 
tanford-Binet. They applied both tests to two groups of retarded 
Or disordered patients, and adopted as their validation criterion 
the record of the later commitments of these patients to a state 
Institution for mental defectives, which was decided on the basis 
histories and from psychi- 


of a stud. facts from case 
ets tions. The data obtained are sum- 


Aric intery; bserva 
marized NMC যা il be seen that the correlations of the 
echsler-Bellevue scale are consistently higher, and in two in- 
Stances very much higher than those yielded by the Revised 
(g.v.) has published subtest correla- 


Stanfo 
ৰ rd-B le. Altus S 2 
ons from টা Lt for Army trainees against criterion of suc- 
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cess in training, using Form B. Considering the shortness of the 


Form B subtests and the narrow range of ability of the group, 
these indicate a significant degree of relationship. 


TABLE 14 


CORRELATIONS OF TWO SCALES WITH COMMITMENT TO INSTITUTION FOR 
FEEBLEMINDED 


(After Balinsky and Wechsler) 


CORRELATIONS OF I.Q.’S WITH INSTITUTIONAL 
NUMBER OF CASES CoMMITMENT 
IN EACH GROUP 


Wechsler-Bellevue Revised Stanford 
49 *753 664 
36 720 | 611 
81 791 *325 
63 *785 274 


C. General points. The Wechsler-Bellevue scale is intrinsically 
superior for adults in comparison with the Revised Stanford-Binet 
scale. It does not involve the difficulty encountered by the latter 
in connection with ages above 16. Its units of measurement are 
more realistic and exact, in spite of the rather unfortunate use of 
the term intelligence quotient. The subtests are better suited for 
older persons than the upper-level subtests of the Stanford-Binet. 
Besides this, it is a more flexible instrument, though it might be 
made still more so if norms were available for the separate sub- 
tests so that they could be used independently at the discretion 
of the examiner. 


2. Detroit Tests of Learning Aptitude * 

{ Another instrument comparable to the Wechsler-Bellevue scale 
in general make-up, working principles, and relationship to the 
ideas of Binet is the Detroit Tests of Learning Aptitude. It con- 
sists of 19 subtests, 13 of them being used for ages 3 to 6, 15 for 
ages 9 to 12, and 13 for ages 14 and over.)Representative samples 
are shown in Figure 11. A separate M.A. is obtainable from each 
subtest, which is an innovation in line with advanced notions of 
test construction. However, in comparison to the Stanford-Binet, 


* Reference: Baker and Leland. 
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the Kuhlmann tests, and the Wechsler-Bellevue, its standardiza- 
tion is inadequate, and the analytic interpretations presented in 
the manual by the authors are unconvincing. It is supposed to 
reveal reasoning and comprehension, practical judgment, verbal 
ability, number ability, auditory attention ability, visual attention 
ability, and motor ability. The authors also claim that these abil- 
ities are embodied in various school subjects. But the whole inter- 
pretation lacks statistical foundation, and while basic mental traits 
may perhaps be revealed by factor analysis, they certainly cannot 
be by direct inspection, for then they appear uncomfortably like 
faculties. Moreover, the title, mentioning “learning aptitude,” is 
attractive. But one must remember that the specific relationship 
between mental test performance and learning performance is 
slender and doubtful. All in all, criticism on the grounds of being 
a “hodgepodge” seems far juster here than when directed against 
the Stanford-Binet scale. \ 


ot 


COMPLETE DEPARTURE FROM BINET’S PRACTICE 


1. L.E.R. Intelligence Scale CAVD (Thorndike and Others) * 
te departure from the ideas and 


(This scale marks a very defini ure fron 
dual administration at the lower 


Practices of Binet. It is for indivi : 
age levels, and can be used as a group test at upper levels. It is 
applicable from the age of 3 years to upper adult levels. ) ৈ 

It embodies very clear and explicit conceptions. Thorndike, as 
already pointed out, regards intelligence as manifesting the four 
attributes of altitude as determined by the difficulty of the tasks 
that can be done, range as meaning the spread of different tasks 
at each level of difficulty, area as meaning the total number of 
Possible tasks at all levels of difficulty, and speed. Altitude is the 
most essential characteristic of intelligence, and range is said to 
Correlate with it perfectly, at least in theory, the relationship being 
Owered in individual cases by the effects of training, experience, 
and opportunity. Thus the scale is set up primarily to measure 


altitude of intellect 5 ন 
Ut CORBIS of four types of items: sentence completion, arith- 
metical problems, vocabulary, and directions.)In deciding upon 
these types of terns Thorndike was influenced by the following 
. Ll iy ¢ 
Considerations. 


* References: Thorndike and Others, 1927; Thorndike, Woodyard, and Lorge. 
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1. Pictorial Absurdities 
Pictures with something foolish in each, to be identified. 
2. Verbal Absurdities 
A series of absurd statements, absurdity to be identified. 
5. Motor Speed and Precision 
Putting crosses in numerous circles. 
6. Auditory Attention Span for Unrelated Words 
Two sets of unrelated words in items of increasing length, to 
be given orally and repeated. 
7. Oral Commissions 
Giving the subject instructions to do Various things. 
8. Social Adjustment A 


Questions about social situations, e.g., what to do if one’s 
radio disturbs somebody. 


IO. Orientation 
42 miscellaneous questions. 
II. Free Association 


A list of stimulus words given orally to which the subject 
makes a free-association response. 


12. Memory for Designs 
Copying geometrical figures from memory. 
19. Likenesses and Differences 


Pairs of things named, some alike, some different, subject to 
tell which and in what way. 


Fic. 11. DEtRorT TESTs oF LEARNING APTITUDE. 
PARTIAL SYNOPSIS 


First were considerations having to do with Psychological 
theory. (a) A response to parts or aspects of a situation is more 
“intellectual” than one to gross totals. (b) A response to parts or 
aspects of a situation not presented to the senses is more intellec- 
tual than one to those that are presented to the senses. (c) A re- 
sponse to the relationships between Objects is more intellectual 
than a response to the objects themselves. (d) A response to sub- 
jective relationships, such as likeness and difference, is more in- 
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tellectual than a response to objective relationships, such as those 
of space and time. (e) The organization of many mental connec- 
tions, i.e., “thinking things together,” is more intellectual than 
using one habit at a time. (f) Response to novel situations is more 
intellectual than response to familiar situations. 

Second are considerations related to the theory of measurement. 
(a) Tasks representing a single ability should be capable of very 
fine gradations from easy to hard. (b) Such tasks should be 
capable of very wide extension by alternatives at any level of 
difficulty. (c) So far as possible any one single ability should 
represent something varying only in amount. 

Third are considerations related to common sense. (a) The tasks 
should have high correlations with reasonable criteria of intellect. 
(b) They should be convenient for use. (c) They should be tasks 
for which subjects for experimentation are available. 

Thus Thorndike arrives at his notion of “intellect CAVD.” His 
explanation of the four designations is as follows. “C. To supply 
Words so as to make a statement true and sensible. A. To solve 
arithmetical problems. V. To understand simple words. D. To un- 
derstand connected discourse as in oral directions or paragraph 
reading” (Thorndike and Others, 1927, P. 65). 

The four extensive subtests of the scale are subdivided into 17 
levels, designated from A to Q. On level A the expectation is that 
a child with a mental age of 3 will get 50% of the items right. On 
level Q a score of 50% is attained by less than 10% of college 
students. Levels A to E are for preschool, F to H for elementary 
school, G to K for junior high school, I to M for senior high school, 
and N to Q for college and graduate levels. There are in all 40 
items at each level, making 10 for each subtest. 

A striking feature of the scale is that it purports to measure 
increments of intelligence from a true zero point and in equal 
units. The steps of equal difficulty are obtained by determining the 
difficulty level of test performances by the percentages of the 
standardization group passing at each point, and by converting 
these percentages into scores based on the standard deviation of 
the group. 

The scale yields three kinds of scores, namely altitude scores 
which mean the percent of items passed, range scores which mean 
the percent of items passed at each level, and area scores which 
mean the sum of all successes. In its original form the scale is 
difficult to score and to administer. A more practicable form has 
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been developed by Thorndike and Woodyard (g.v.). The obtained 
reliabilities and intercorrelations of subtests are high. | 

The distinctive feature of the scale as a Psychometric instrument 
lies in Thorndike’s claim that it represents a true unitary sampling 
of intellect very sharply defined. In this it is in sharp contrast with 
Binet’s own scale and the instruments more or less patterned on 
it, which are loose composites and defended as such. Also, it differs 
sharply from the Wechsler-Bellevue scale or the Detroit Tests of 
Learning Aptitude, which are also composite in character. 


SUGGESTED ADDITIONAL READINGS 


The reader who wishes to make a thoroughgoing study of any of 
the scales here discussed should turn to the references on each that 
are given in the text. However, the following suggestions are made: 

Rudolph Pintner, Intelligence testing: methods and results (Rev. 
ed.; New York: Henry Holt and Company, 1937), Chapter 2, “The 
work of Binet.” 

Quin McNemar, The revision of the Stanford-Binet Scale. An 
analysis of the standardization data (Boston: Houghton Mifflin 
Company, 1942), Chapter 1, “The revision procedures.” 

Lewis M. Terman and Others, The Stanford Revision and Exten- 
sion 0f the Binet-Simon Scale for Measuring Intelligence (Baltimore: 
Warwick and York, Inc., 1917). This is a detailed technical account 
of the first Stanford Revision. i 

Lewis M. Terman, The measurement of intelligence (Boston: 
Houghton Mifflin Company, 1916). This is a less technical account 
of the first Stanford Revision. 

Lewis M. Terman and Maud A. Merrill, Measuring intelligence 
(Boston: Houghton Mifflin Company, 1037). This is an account, not 
highly technical, of the second Stanford Revision. 

F. Kuhlmann, Tests 0f mental development (Minneapolis: Edu- 
cational Test Bureau, 1939). Contains a full account of the tests, 
their principles, scoring, etc. 

David Wechsler, The measurement 0f adult intelligence (Balti- 


more: The Williams and Wilkins Company, 1944), Chapter 10, “Limi- 
tations and special merits.” 


QUESTIONS FoR Discussion 


1. Assemble from the chapter all the various methods used for 
constructing intelligence scales with a view to making them valid, and 


of demonstrating their validity. Consider how much they seem to 
prove, and what they prove. 
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2. Which of the two meanings of mental age seems to you more 
reasonable? Which is more commonly used? Read Otis 1916 and 
Thurstone 1926 on the matter. 

3. From the data in Table 10 and Figure 6 show how any given 
I.Q. would shift at different age levels if the subject remained the 
same standard deviation distance from the mean at all ages. Would 
similar shifts occur with the Percent of Average (Table 17)? With 
the Wechsler I[.Q.’s (Table 12)? 

4. Compare the Stanford-Binet scale, the Tests of Mental Develop- 
ment, and the Wechsler-Bellevue scale to see how far the arguments 
in favor of point scales summarized in Figure 8 are borne out in 
connection with them. 

5. What are the arguments for and against a “composite” or very 
diversified scale for the measurement of intelligence? Discuss this in 
connection with the specific scales here considered. 

6. In what respects do Terman and Kuhlmann seem to agree as 
TR the course of mental growth? What difference do you find between 
them? 

7. List specifically and discuss the progressive steps away from the 
practices and ideas of Binet represented in the scales here considered. 

8. Why might a composite test be expected to have more clinical 
value than a highly unified one? Would this make the former a better 
test of intelligence? 

9. Why does an age of arrest occur in the scoring of the Stanford- 
Binet scale but not in the scoring of the Wechsler-Bellevue scale or 
the L.E.R. Intelligence Scale CAVD? 

10. Vocabulary subtests are used in several of the scales here dis- 
cussed. Which ones? Does this suggest any general attitude on the 
part of psychometric workers to criticisms of such tests? 


CHAPTER V 


TESTS OF INTELLIGENCE (I) 


Group intelligence testing has an origin almost as definite as 
that which led to the type of instruments just discussed. It 
emerged as a major movement out of the work done in the United 
States Army in World War I. It was not at that time a wholly 
new undertaking, but it received a major impulsion from the Army 
Work and its success. 

The evolution of intelligence testing, however, has been 
much less clear-cut than that of individual intelligence testing. 
Many of thc early group tests are still widely used; usually in 
revisions and sometimes with new names, with refinements and 
improvements, but without basic alteration. Moreover the lines of 
developmen‘ have been much more diverse. The present chapter 
and the following one are devoted to this Subject. The topics to be 
covered are: Army testing during World War I, tests used pri- 
‘marily for wide age ranges, tests for high school and college level, 
performance tests, tests for young children, tests for adults, new 
and emerging types of tests. Under each of these headings an 
attempt will be made to show the development that has taken 
Place, and the whole discussion will end with a summary and 
appraisal of significant trends. 


OricIns: ARMY TEsTING IN WoRrLD War I 


As has been said, the first major development of group intelli- 
gence testing was the work in the United States Army during 
World War I. The original tests that were constructed have be- 
come outmoded, although some of their revisions are Still used to 
some extent. But many of the persistent problems of group meas- 
urement defined themselves then, and many of its characteristic 
concepts and methods were established. Thus some understanding 
of the original Army testing program is valuable as leading to a 
comprehension of the development since that time. 


1. The tests and their chief revisions * 


‘ Two major tests were developed in connection with the Army 


* References: Yerkes, 1921; Yoakum and Yerkes (for the tests); Guilford 
(1938) ; Wells (1932) (for Modified Alpha) ; Kellogg and Morton (for revised 
Beta). 
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program. One was a verbal intelligence test, known as Army Group 
Intelligence Examination Alpha. The other was a performance 
test, known as Army Group Examination Beta. 

A. Army Group Intelligence Examination Alpha. This test was 
arrived at by a fairly prolonged experimental process, and is itself 
a revision of an earlier tentative instrument known as “A>” It 
consists of 8 subtests. (1) Following verbal directions, intended 
to reveal the span of auditory attention, with a time allotment of 
4 minutes. (2) 20 arithmetic problems, time 5 minutes. (3) 16 
items involving “common sense” or “practical judgment” of what 
to do in described problematic situations, time 12 minutes. (4) 
40 pairs of words to be marked as meaning the same or different, 
time 1% minutes. (5) 24 disarranged sentences to be understood 
and marked true or false. (6) 20 incomplete number series to be 
completed, time 3 minutes. (7) 40 verbal analogies, time 3 min- 
utes. (8) 40 multiple-choice items calling for miscellaneous in- 
formation. 

The test is scored on the number of right responses, except with 
subtests 4 and 5, in which a deduction for error is made to com- 
Pensate for chance. The test was originally in five forms of equiva- 
lent difficulty, the same pattern of subtests appearing in each, but 
With different items. 

B. Revisions of Army Alpha. The First Nebraska Revision 
follows the original make-up and scoring plan. The revision is 
based on an analysis of the items. The items found by correlation 
With the test as a whole and by their power of differentiation to 
be the most diagnostic were retained. Also, numerous items in the 
Original test which implied special Army knowledge and experience 
Were eliminated. The original norms were retained. Intelligence 
quotient norms were worked out on the basis of the distribution 
Of intelligence quotients as given by Terman and Merrill (1037 b)" 
for the American White population. No data on validity or relia- 
bility are given in the manual. 

In the Schrammel-Brannan Revision, which is intended for 
grades 4 to 16, the original five forms are reduced to three. The 
8 subtests are retained. Oral directions and instructions to the 
subjects, which played an important part in the original instru- 
ment, are reduced, and the test is made largely self-administer- 
ing. Also, devices are introduced to facilitate scoring. Norms were 
Worked out for populations of about 700 for each grade from 4 to 
16, together with about 200 college freshmen and about 300 sub- 
Jects for each of the following three college years. Age norms were 
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extended from 8 to 25, but are best from 9 to I A Progressive 
increase of scores with age up to 17 was found, after which the 
relationship of score to age became irregular. 

Another important revision, made by F. L. Wells and published 
in 1939, is known as Modified Alpha Examination (v. Wells, 1932). 
The practical judgment subtest is eliminated, and replaced by a 
numerical subtest. Antiquated items are revised. The Subtests are 
arranged so that separate numerical and verbal Scores can be 
obtained. Percentile norms are supplied for high school boys, for 
high school girls, for seventh and eighth graders, and for adult 
men applying for executive positions. The reported self-correlation 
for total score is .92 with a standard error of 6.71. Total score 
correlations with the Otis Self-Administering Test of Mental 
Ability are .73 for high school freshmen, .64 for high school sopho- 
mores, .72 for high school juniors, and -79 for high school seniors. 

Another type of revision has turned in part upon the elimina- 
tion of the separate subtests, and the Organization of the whole 
instrument into a single “scrambled” or “spiral omnibus” form. 
The chief argument for this is administrative convenience, for 
Whatever instructions are needed can be given all at one time at 
the start of the testing instead of breaking off at the end of each 
subtest. The subject responds to a series of items drawn in irregu- 
lar order from the various subtests. 

C. Army Group Examination Beta. This is a paper-and- 
pencil test, in general of “performance” type, intended for those 
unable to read English. It was arranged so that the examiner 
could instruct the Subjects by pantomime and with a minimum 
of words, and by means of blackboard demonstrations on how to 
Work. It consisted of 7 subtests. (1) 5 mazes, time 2 minutes; 
emphasizing speed. (2) 16 Pictures of piles of cubes, the task being 
to give the number in each pile, time 24 minutes. (3) Nonverbal 
completions, consisting of Patterns of the letters X and O to be 
completed in series as begun, time 15 minutes. (4) Association 
of symbols with numbers according to a given code, time 2 min- 
utes. (5) A series of pairs of numbers to be compared, the task 
being to mark those in which the numbers were different, time 3 
minutes. (6) Drawing the missing parts of a Picture, time 3 min- 
utes. (7) 10 paper form-board problems, time 2 minutes.* 

* To understand the general character of the paper form board see Figure 12. 


This is a good concrete Mustration of the problem of translating performance 
items into a form manageable in group testing. 
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The scoring is on the number of right responses, except with 
the code substitution subtest. 

D. Revision of Army Beta. The most important revision of 
Beta is that made by Kellogg and Morton. Certain not very 
important changes were made in the subtests. (a) The maze test 
was retained from the original Beta. (b) Tests 2 and 3 from the 
original Beta were eliminated (cube analysis and X-O). As sub- 
stitutes were introduced a picture discrimination test in which 
the task was to find the wrong item in a picture, and a modifica- 
tion of the digit-symbol test, according to which the subject writes 
in numbers instead of symbols as he previously did. (c) Test 5 in 
the original Beta (number comparison) was extended to include 
pairs of pictures as well as pairs of numbers, and also pairs of 
symbols. (d) Tests 6 and 7 in the original (picture completion 
and form boards or geometrical constructions) were retained in 
their previous form, except that they were lengthened. 

The test was standardized on Canadian children but is suitable 
for use in the United States. A split-half reliability of .987, and 
a retest reliability of .77 is reported. The authors also give mental 
age equivalents for the various scores. But they do not give mental 
age distributions at the various ages, which makes the calculation 
of intelligence quotients meaningless. On the whole, the revision 
is better than the original test, which merely served for the rapid 
measurement of illiterates. 


2. Points of significance for psychometrics 


In the construction, revision, and use of these tests a number 
Of considerations of permanent and far-reaching significance for 
‘mental testing were involved (v. Yoakum and Yerkes; Yerkes; 
throughout). 

A. The conception of intelligence. A working conception of 
intelligence was set up as a guide and a criterion for validation. 
It consisted of (a) formal school attainment; (b) scores on the 
Stanford Revision of the Binet scale; (c) ratings by- officers. 
Ratings on these three variables were obtained for a standardiza- 
tion group of about 900. A preliminary form of Alpha was found 
to correlate .75 with the first, from .80 to .90 with the second, and 
from .50 to .70 with the third. Alpha was found to correlate .72 
with scores on the Trabue Language Completion Scale, an early 
group intelligence test. Total scores on Alpha correlated .80 with 
total scores on Beta. 
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EXAMPLE 


IA 


Which of the five figures A. B. C. D. E reproduce the two forms 
in the example placed together? 


Fic. 12. SAMPLE ITEM FROM THE REVISED MINNESOTA 
PAPER FoRM BOARD 
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B. Criteria for selecting subtests. Two criteria were kept in 
mind for the selection of subtests. These have always been re- 
garded as of importance since that time. (a) It was desired that 
the separate subtests should have the highest possible correlations 
with total scores on the test. In the case of Alpha, the lowest 
correlations were found in the case of verbal directions, practi- 
cal judgment, and disarranged sentences, the coefficients running 
around .65. The highest, running around .85, were found for 
arithmetic problems, verbal opposites, verbal analogies, and infor- 
mation. (b) It was considered desirable that the subtests, while 
correlating as high as possible with the test as a whole, should 
correlate as low as possible with one another. This is a com- 
monly accepted statistical norm, for which there is an obvious 
reason. Closely related subtests clearly involve much in common. 
Thus each one will give less new information about the subject 
than in the case of subtests which are not closely related and have 
little in common. Put in statistical terms, if the subtests have high 
intercorrelations, each adds little to the validity of the total final 
score. Thus low intercorrelations are statistically desirable. But 
often they are psychologically impossible, for the reason that the 
subtests are not built to measure distinct and different mental 
factors, and overlap greatly in their factorial content. This was 
found to be so in the present case, for the mean of intercorrela- 
tions of the subtests of Alpha was about .61, and those with the 
closest relationship to total scores had the closest relationship to 
one another. So in the Army test construction this second criterion 
had to be abandoned, which has often happened in such work 
since then. 

C. Specialized items. Army Group Intelligence Examination 
Alpha, in original form, contains many items calling for military 
experience and information. This has been a point of criticism, 
and it has been altered in the later revisions of the test for general 


use. However, as F. N. Freeman (1939) points out, the acquisition 
of ideas and infOrmation in a uniform environment is a legitimate 
index of intelligence. This consideration has assumed an increas- 
ing importance in recent years, for there has been a tendency to 
Shape intelligence tests with reference to the background and 
interests of special groups, such as medical students, naval per- 
sonnel, specialists in chemistry, and the like. 

D. Speed and power. An issue of considerable general importance 
which was raised conspicuously in the Army work was that of the 
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significance of speed and power in intelligence testing. Alpha, as 
will be seen from the description given above, was a closely timed 


As to the issue in general, in some tests, such as th 
writing or stenography, speed may be the main Cc 
although even there it is associated with accuracy, I 
tests, however, power is clearly the important factor. 


Ose for type- 
Onsideration, 
1 intelligence 
In any situa- 


simply a series of homogeneous low-level tasks cannot be a good 
test of intelligence. And as Thorndike (Thorndike and others, 
1927) remarks, the time taken to do a job can Only be significant 
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if the job is performed substantially correctly. Even then there is 
a question, for one would not say that if Einstein had taken twice 
as long as he did to work out the basic formula for the theory of 
relativity, this proved that he had much less intelligence than we 
now suppose. 

E. Mental age equivalents. By all means, the most popularly 
exciting issue in connection with the Army work turned on the 
mental age equivalent of scores on Alpha. Also, this issue involves 
much of far-reaching importance for psychometric theory and 
Practice. 

Mental age equivalents of various levels of Alpha performance 
Were found by comparing the Alpha scores with the Stanford-Binet 
mental ages of the standardization group to whom the latter test 
had been given. Following the letter rating shown in Table 7, 
the mental age equivalents were these: D— (scores 0-14) corre- 
Sponded to an M.A. range of 0 to 9-4. D (scores 15-24) corre- 
Sponded to an M.A. range of 9-5 to 10-9. C— (scores 25-44) 
Corresponded to an M.A. range of 11-0 to 12-9. C (scores 45-74) 
Corresponded to an M.A. range of 13-0 to 14-9. C+ (scores 75- 
104) corresponded to an M.A. range of 15 to 16-4. B (scores 105- 
134) corresponded to an M.A. range of 16-5 to 17-9. A (scores 
135-212) corresponded to an M.A. range of 18 and upwards. Now, 
there is a formidable implication contained in these data. Since 
the median of the distribution of Alpha scores falls inside the 
Step interval of C, it follows that the median mental age revealed 
is about 13 years, and this was found for a presumably repre- 
sentative sample of the American White draft, which was also 
Presumably a representative sample of the adult population. 

When these implications were pointed out with some emphasis, 
and their logical consequences bearing on mass intelligence, mass 
entertainment, and the support of democratic institutions were 
elaborated, they came as a violent and sensational shock. Of more 
importance for students of Psychometrics, they were emphatically 
at variance with the results of Terman, Who had published data 
Showing that the distribution of intelligence quotients at various 
age levels is approximately normal (1916). Two types of explana- 
tions have been offered. (a) One is to the effect that a test like 
the Stanford-Binet, made for and given to children in school, will 
involve grave falsification when it is given to adults anywhere 
from six to ten years out of school, because of shifts of attitude 
and the deterioration of various skills and information called for 
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i সখ her is to the effect 
in the items (FE. N. Freeman, 1939). (b) The ot. j 
{Hat ‘nkelliBence is actually dulled and lowered by disuse and 
facilitated and increased by use (Stoddard, 1943). This, of Course, 


Ina sample six-month period .5% of inductees were reported 
for discharge because of mental inferiority, ‘6% were recom- 
mended for labor battalions because of low intelligence, -6% were 


found located chiefly in the A and B Categories of intelligence 
Scores. Those below C— Were expected rarely to Succeed in officer 
training courses. Noncommissioned Officers were Chiefly in the C+ 


ultimate validation of Any test is extended Practical use. This is 
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World War, group testing has been much influenced by its patterns 
and practices but has also developed certain new trends which did 
not appear at that time. We turn now to deal with this further 
development. 


GROUP TESTS FOR CONSIDERABLE AGE RANGES 


A large number of tests, or rather test batteries, have been 
Constructed which are applicable to wide age ranges. Some of the 
most important are the following. 


L Haggerty Intelligence Examination, Delta 1 and Delta 2 


Amis test, which was published in 1920, is a good early example 
Of the type of instrument here under consideration. Delta 1 is for 
Brades 1 to 3. Delta 2 is for grades 3 to 9. This instrument is a 
Eood example of competent direct adaptation of Army practice, 


and it has had a wide use. 
Delta 1 consists of six subtests. (1) Oral directions to be fol- 


lowed. (2) Copying designs. (3) Picture completion. (4) Picture 
comparison. (5) Digit-symbol. (6) Word comparisons. The first 
four subtests are nonverbal, and have the general character of 
Performance tests. The last two presuppose reading. In each sub- 
test preliminary exercises for practice are given, with the intention 
Of orienting the pupil to the test and of equalizing preliminary 
experience as much as possible. 

Delta 2 also consists of six subtests. (1) Discrimination between 
true and false statements. (2) Arithmetic problems. (3) Picture 
Completion. (4) Discrimination between word pairs as meaning 
the same or opposite. (5) Common-sense or practical judgment in 
described situations. (6) General information. It is a very direct 
adaptation of Army Alpha, the practice exercises it introduces 
being the most significant CELE) 


ZG National Intelligence Tests 

This battery, published in 1920, was constructed by many of the 
same persons who developed and conducted the Army testing pro- 
Sram. It has had a great popularity, but is now becoming less used. 

There are two scales to be used in testing at each level. These 
Come in two separate booklets. A synoptic outline of Scale A is 
Shown in Figure 13. Scale B also consists of five subtests as fol- 
lows: (1) Arithmetic problems. (2) General information. (3) Log- 
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ical judgment. (4) Verbal analogies. ( 5) Similarities and differ- 
ences among numbers, names, and forms. 

The standardization was very thorough, involving the use of 
about 4,000 cases for each age and grade level. The test yields a 
total numerical score, the highest obtainable being 196. A distinc- 
tive feature is the fore-exercise for each subtest, which runs to 
about half its length. It is somewhat of a question whether “dead” 
practice material of this length does not unduly reduce testing 
time. Also, it has been Suggested that such extensive Practice may 
in part invalidate the test as an indication of adaptiveness. 


This test appeared Originally in 1932. It was extended in 1042, 
and in 1946 new norms and interpretive material Were published. 
It is a good example of a 8r0Uup test which begins to depart from 


turn were tried out on 500 Pupils, and those which Correlated best 
with total scores and discriminated best between those known to 
be bright and known to be dull Were chosen. This yielded a final 
list of 180 items, which were Organized into two forms. This 
process of item selection was applied at each level. Two forms 
were originally constructed for each of grades 3 to 8, 7 to 12, and 
13 to 16. Later a third form was added for grades 3 to 8 and 
7 to.12. Additional forms are being developed. 

The reliability coefficients reported for all levels are high, run- 
ning in the high .80’s and ‘90's. Percentile norms for each school 

* Reference: Henmon and Nelson. 
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I. ARITHMETICAL REASONING 
For example: 
How many cents are six cents and five cents? 
How many nickels make a dollar? 
How many square inches are there in a card 7 inches long by 
6 inches wide? 


2. SENTENCE COMPLETION 
For example: 
Fish: Swim’ ..o.s.ue oles the water. 
Boys: uuscosia girls like to ........ ball. 


3. LocICAL SELECTION 
For example: 
Draw a line under each of the words that tell what the thing 
always has 
Table—books, cloth, dishes, top, legs. 
Shoe—button, hook, sole, toe, tongue. 


4. SAME-OPPOSITE 


For example: 
If the two words mean the same write S. If they are as dif- 


ferent as they can be write D. 
Yes Cd a pte No 


anes LBTIGHLS 


5. SyMsBoL-DiciT 


For example: 
Make under each drawing the number found under that draw- 


ing in the key. 


FiG. 13. NATIONAL INTELLIGENCE TEST, SCALE A. 
SYNOPTIC OUTLINE 
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grade and for the four college years are given. Also, mental age 
equivalents for the obtained scores are given for elementary school 
and high school groups. The norms are worked out for several 
hundred thousand cases, numbers varying for various levels. ‘The 
test correlates from about .5o to over .90 with other good intelli- 
gence tests. It correlates from -45 to .65 with course marks and 
grade averages in college. 

The Henmon-Nelson battery has two distinctive features which 
mark it as an advance on the Psychometric practices established 
in the Army work. (a) It is Organized on the “spiral omnibus” or 
“scrambled” plan. The items are of wide Variety, including in each 
form at each level such types as information, sentence completion, 
logical selection, classification, verbal analogies, number relation- 
Ships, anagrams, disarranged sentences, geometrical analogies, 
proverbs, word meaning, identifying family relationships, and 
arithmetical problems. This range of items is not built into sepa- 
rate subtests, but set up in a mixed sequence. The effect is to gain 

istration. (b) Scoring has been rendered 


will be seen, these are departures from Army testing Practice, both 
in the interest of efficiency, and neither of them, so far as we know, 
affecting validity or reliability. 

Percentile norms are given for each grade for which the test is 
intended. Grade norms are very frequently given in connection 
with tests which, like the one under discussion, are among the best 
and most carefully constructed that we have. However, they are 
always open to some question, for the obvious reason that a 
school grade is an administrative and not a Psychological entity. 
The fact that a child is, for instance, in the sixth grade conveys 
comparatively little information about his mental Or even his 
educational status, so that to rate him on a grade norm may be 
quite misleading. However, the test also provides for mental age 
norms and derived intelligence quotients based On raw scores in 


by 
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the first instance, and on raw scores and chronological ages in the 
second. This undoubtedly avoids many of the objections to stand- 
ardization on grade levels alone. The criticism of grade norms has 
no great importance where this test is concerned, but it is a 
general point, exemplified in this instance, to which the attention 
Of the reader should be called. 


4. Otis Group Intelligence Scale * 


/ 
This battery consists of an Advanced Examination and a Pri- 


ary Examination, the former being the better known and the 
more widely used. The Advanced Examination is intended for 
Subjects of any age so long as they can read, and specifically for 
those above grade 4. It has been widely used in high school, and 
it is said to be not unsuited for some uses in college, although it 
1s too easy for this level. It contains ten subtests. (1) Following 
printed directions, e.g., indicating the fourth letter of the alpha- 
bet. (2) Verbal opposites. (3) Disarranged Sentences to be marked 
true or false. (4) Proverbs to be interpreted, with the proper in- 
terpretation to be indicated from a choice of statements. (5) Arith- 
‘Metical problems. (6) Geometric figures similar to those shown in 
Figure 1. (7) Verbal analogies. (8) Similarities, in general like 
those shown and cited in Figure 13. (9) Narrative completion. 
(10) Memory, consisting of a story read aloud with questions 
following. . 

(The primary examination, which was made later, consists of 
eight nonreading group tests, and is intended for primary grades 
and kindergarten.) 

Votaw (g.v.) has published regression lines and equations for 
Predicting American Council on Education Test (g.v.) scores from 
Otis Group scores. The work was done on 70 subjects who were 


. Siven the Otis Test in junior high school and the American Coun- 


cil Test six years later. With Otis scores on the X-axis and Ameri- 
Can Council scores on the Y-axis, the equations derived are as 
follows: Y = 1.42 X — 67.2: X = 3747 Y + 74.4. 


5. Otis Self-Administering Tests of Mental Ability 


This well-known and extremely popular battery, although it 
Was developed quite early, represents a departure from original 


* Publication dates of all tests are cited in the bibliography of tests at the 
close of the book. 
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Army practice similar to that of Henmon and Nelson. The Self- 
Administering Tests are for two levels. The. Intermediate Level is 
for grades 4 to 9. The Advanced Examination is for high school 
and college. Both examinations come in two forms. 

Each form contains 75 items on each level. Their content is 
conventional. The “‘self-administering” feature refers to and is 
based upon the “scrambled” or “spiral omnibus” arrangement, an 
instance of which is shown in Figure 14. The test has high relia- 


————— 


An electric light is to a candle as a motorcycle is to? 


1. bicycle 2. automobile 3. wheels 4.speed 5.police...... fe)’ 
Which one of the words below would come first in the dictionary? 

1. march 2. horse 3.0cean 4.paint 5.elbow 6.night 7.flown ( ) 
The daughter of my brother is my? 

I.sister 2.niece 3.cousin 4.aunt 5. granddaughter. ...... ( ) 
One number is wrong in the following series. Which would that 

number be? 

3% 51345 3: 5০০ SE Ee HOLE A ALE TS OE (an) 
Which of the five things given below is most like these three: 

Boat, horse, train. 

1.sail 2.r0w 3.motorcycle 4.move 5.track..... EN (3 


If Paul is taller than Herbert and Paul is shorter than Robert, 
then Robert is (?) Herbert. 
1. taller than 2. shorter than. 3. justastallas 4. cannot say ( 


ie) 


A wire is to electricity as (?) is to gas. 
1.aflame 2.aspark 3.hot 4.apipe 5.astove........... ( ) 


FIG. 14. INSTANCE OF “SCRAMBLED” ARRANGEMENT OF TEST ITEMS 
(From Otis Self-Administering Tests of Mental Ability) 


bilities, coefficients of .92 being reported for the Advanced Exami- 
nation and .95 for the Intermediate Examination. It yields a 
point score that can be transformed into mental ages and so- 
called intelligence quotients. The method of establishing these 
derived values, however, is complex and rather arbitrary. In 
particular the intelligence quotients are not true quotients at all. 
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They are based on the deviation or difference of the subject's 
Score from the mean score for the age group. If his score is above 
the mean, this deviation is added to 100. If it is below the mean, 
the deviation is subtracted from 100. This has a certain similarity 
to Wechsler’s method of calculating intelligence quotients, by 
defining + P.E. below the mean as I.Q. 90. It is, however, less 
technically justifiable. And the general objection is the same. A 
term which has had an enormous popular vogue is exploited. 
Scores are reported which have all the appearance of true intelli- 
gence quotients. In reality, however, they are not quotients in any 
Sense whatsoever, but deviation scores, the true basis of which is 
Concealed. 

. In order to show the misleading character of the measure there 
is presented in Table 15 an array of scores on the Advanced Exam- 


TABLE 15 


ScoRES ON Oris SELF-ADMINISTERING TESTS OF MENTAL ABILITY, 
ADVANCED EXAMINATION, WITH STANFORD-BINET M.A. 
AND OTIS I.Q. EQUIVALENTS 


(Adapted from Bingham, Table 40, P. 337) 


—_—_—_—_—_—_—_——— 
Otis Scores Stanford-Binet M.A.s Otis I.Q.'s 

eo 0006 ps he He EUG Om 

72 19-3 130 

66 18-6 124 

60 7-0 IIS 

54 17-0 II2 

48 16-3 106 

42 L 15-4 100 

36 14-4 94 

30 13-3 88 

24 12-2 82 

18 IIo 76 

I2 I0—0 70 


ination of the Otis Self-Administering Test, together with their 
equivalent Otis IL.Q.’s and Stanford-Binet M.A.’s. In each case there 
1s a decided discrepancy. For instance, a score of 72 On the test 
indicates an I.Q. of 130 and a Stanford-Binet M.A. of 19-3. Taking 
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the C.A. as 16, which is proper in connection with most subjects for 
whom the Advanced Examination would be used, and computing the 
Stanford-Binet 1.Q. on this basis, it works out at approximately 
120. In the same way the score of 12 indicates an Otis I.Q. of 70, 
Dut the Stanford-Binet M.A. of 10-0 yields an IL.Q. of approxi- 
mately 62. This latter difference of eight points may not seem so 
very great, but one must remember that it may determine whether 
a subject is tentatively classed as dull-normal, or feeble-minded, 
so it can be more important than it looks. 


6. Otis Quick-Scoring Mental Ability Tests * 


This is yet another, and in some respects superior, instance of 
a departure from Army testing practices along similar lines. Pub- 
lished at a later date than the test just described it is essentially 
a development of it. The battery has three levels. The Alpha Test 
is for grades 1A to 4. The Beta Test is for grades 4 to 9. The 
Gamma Test is for grades 9 to 16. The battery has four forms, 
two of which are machine scorable. 

Each test consists of 80 items. The content is of the conventional 
kind, consisting of analogies, vocabulary, verbal opposites, dis- 
arranged sentences, reasoning, proverbs, and so forth. The form 
of the items, however, is a distinctive feature. All of them are 
thrown into five-choice multiple choice form. This makes for rapid 
response on the part of the subject and also facilitates the 
Scoring. 

As noted above, the test can be obtained arranged for machine 
scoring. But even without this, scoring is very easy. All the re- 
sponses to each test appear on a single sheet, so that a punched 
stencil can be laid over it, with no shifting and no turning of pages. 

The norms for the Beta Test are based on about 16,000 cases. 
The same method of computing intelligence quotients as that 
discussed in connection with the previous battery occurs 
here. 

Distinctive features of this test are its brevity and scorability. 
Yet it is definitely related to the much more extensive American 
Council on Education Psychological Examination for College 
Freshmen. Weber (g.v.) has worked out and published regression 
lines which make it possible to read from the score obtained on 
either one to the probable corresponding score on the other. 

The name of Otis has long been associated with workmanlike, 

* See Otis (1918) for his ideas on test construction. 
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very well-constructed, highly acceptable tests that have superior 
features of efficiency and economy. The two batteries just dis- 
cussed are admirable instances. Both of them are easy to score, 
Particularly the quick-scoring tests. Also they are easy to ad- 
minister, and take very little time in comparison to some others. 
The testing time required for these two examples ranges from 20 
to go minutes. Here a question that naturally suggests itself is 
Whether a reasonably valid and dependable score can be obtained 
S0 rapidly. There is good evidence that it can. The batteries show 
about the same correlations with other intelligence tests as those 
ordinarily expected. Moreover, Bingham (g.v.) has reported a 
Correlation of .793 between the Otis Self-Administering Tests, Ad- 
vanced Examination, and the Scholastic Aptitude Test of the 
College Entrance Board, which requires three hours of testing time, 
Or at least six times as much as the former. A full set of Scholas- 
tic Aptitude Test equivalents for Otis scores is shown in Table 16. 
Note that the top Otis score is at the 92nd percentile S.A.T., 
indicating that the Otis Test is too easy for this group. 


7. Pintner General Ability Tests: Verbal Series 


This is a sequential battery, built partly from revisions of 
earlier tests, which appeared in 1939. It is for three levels: that 
for kindergarten to 2nd grade is the older Pintner-Cunningham 
Primary Test; that for 5th to 8th grade is a revision of the Pintner 
Intelligence Test ; that from the oth to the 12th grade is new. The 
battery comes in two forms. 

The Intermediate Test consists of 8 subtests as follows (Form 
A). (1) Vocabulary, which gives items calling for choice between 
five alternatives to match the meaning of the stimulus words. 
(2) Logical selection, to tell what a thing “always has,” €.8., 
forest—snow, trees, beasts, @ forester, hunters. (3) Number 
Sequence, i.e., choosing a number to finish an incomplete series. 
(4) Best answer, e.g., five reasons for using a knife, the best to be 
chosen. (5) Classification, the items consisting of groups of five 
Words with one not belonging with the others. (6) Verbal oppo- 
sites. (7) Analogies. (8) Arithmetical reasoning. 

A highly distinctive feature of this test is that it yields not only 
a “global” or over-all score, but a set of profile scores on the sub- 
tests. This exemplifies a major development in test construction, 
which has become more and more important in the past ten 
years. 
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PREDICTION OF SCORES ON SCHOLASTIC APTITUDE TEsT (3 Hours) FROM 
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TABLE 16 


Oris SELF-ADMINISTERING TEST, HIGHER, FORM A (30 MINUTES) 
(Quoted from Bingham, Table 41, Pp. 339) 


S.A.T. | Centile | Letter || Otis S.A.T. | Centile | Letter 
Score | Rank | Grade || Score | Score Rank | Grade 
647 92 B 50 420 2I Dp 
638 [) B 49 4II 18 D 
628 go B 48 402 16 D 
619 88 B 47 393 14 D. 
610 86 B 46 384 I2 D 
6or 84 B 45 375 10 D 
592 82 B 44 366 8 D 
583 79 B 43 356 iT D 
574 77 B 42 347 6 E 
565 74 B 41 338 5 E 
556 Tr B 40 329 4 E 
547 68 [0 39 320 3 E 
538 64 C 38 3II 2 E 
529 6r C 37 302 2 E 
520 58 Cc 36 293 I E 
SII 54 Cc 35 284 I E 
502 50 6 34 275 I E 
492 46 Cc 33 266 9 E 
483 43 Cc 32 257 7 E 
472 39 Cc 3I 248 5 E 
465 36 Cc 30 239 4 E 
456 32 Cc 29 230 3 E 
447 29 D 28 220 2 E 
438 26 D 27 2II br E 
429 24 D 26 202 I E 
25 193 I E 


= 
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8. Kuhlmann-Anderson Intelligence Tests * 


This very significant test, which has gone through five revisions 
since its first appearance, marks the most decisive departure so 
far encountered from earlier practice. It has a range of application 
from grades 1 to 12. It comes in nine booklets, each containing 
10 to 12 subtests, the various booklets being for designated age and 
grade levels. Thus each child is always given 10 subtests. There 
are, in all, 39 different subtests in the battery, many recurring at 
different age levels. These subtests involve the use of pictures, 
geometrical figures, mathematics, new associations, and verbal 
relations and information. They were selected from a tentative 
list of 100 possibilities, on the basis of definite increase in 
scores attained at successive age levels. ‘This criterion is much 
emphasized by Kuhlmann. A unique feature of the scoring is that 
mental ages are obtained from tables of equivalents for each sub- 
test, and the M.A. of the subject is his median subtest M.A. 
Performance can also be expressed in Heinis’ Mental Units. 
Boynton reports that the test correlates as high with the Herring- 
Binet as the latter does with the Stanford-Binet, and also that the 
test is an excellent one for the identification of unusually bright 
Pupils. For references see the test manual which has an extended 
discussion of general problems, and also R. G. Anderson. 


TEsTs FoR HicH SCHOOL AND COLLEGE LEvELs 


A large number of tests have been developed which differ from 
the foregoing in being intended for the secondary and college 
levels, that is, for more limited age ranges. The distinction is not 
deep-going, and does not involve any new principles or practices, 
and these tests are considered separately merely for the sake of 
clarity of exposition. Later on, when we turn to deal with tests 
for young children, and also for adults without reference to edu- 
cational placement and status, it will be found that new problems 


Are indeed encountered. 
A Terman Group Testo Mental Ability 


my Alpha. It consists of r0 subtests as follows. (1) Information. 
(2) Best answers to questions involving the interpretation of 
Proverbs and matters of fact. (3) Word meanings and opposites. 
* Reference: R. G. Anderson; Garrett, I94I. 
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(4) Logical selection. (5) Arithmetical problems. (6) Sentence 
meaning, each sentence containing a concept the understanding 
of which determines the response. (7) Analogies. (8) Scrambled 
sentences. (9) Classification. (ro) Number series completion. It 
requires 35 minutes to administer, Y 

\f It consists, in all, of 185 items ‘in each of its two forms. These 
items were selected from an original list of 886, the criterion being 
power to differentiate between persons known to be bright and 
persons known to be dull. Percentile norms are presented for each 
grade from 7 through 12. They are worked out on a standardiza- 
tion group of 41,241 White children, with from 4,000 to 10,000 
at each grade level. This large standardization group was drawn 
chiefly from city schools, two-thirds of them coming from Cali- 
fornia. The result is the development of norms Which are probably 
somewhat high. A table of mental age equivalents for raw scores 
is included in the manual. 

This is one of the earliest good group intelligence tests for use 
at the secondary level. It lacks the characteristic “efficiency” 
features. which have developed since 1920, when it was published, 
but it is easy to administer and easy to score. The use of percentile 
grade norms is open to some objection. However, the test still 
stands up well. For general classification and for the prediction of 
academic success it is a useful instrument. It is highly verbal in 
content, and its relationship to success in trade schools and indus- 
trial schools is not so clear. Moreover, it has some weakness in its 
power to discriminate between the average and the superior. For 
instance, subtest 2 (best answers) is so easy that of 1,146 college 
men and 628 college women 45% and 35%, respectively, made 
perfect scores (Boynton). 


2. Terman-McNemar Test of Mental Ability * 


(This test, published in 1941, is a development of the Terman 
Group Test of Mental Ability. It consists of seven subtests, 
namely, information, Synonyms, logical selection, classification, 
analogies, opposites, and best answer. In constructing it, the Ter- 
man Group Test was used as a reservoir of items, and enough 
more were added for three experimental tests of the same length 
as the present one. The three experimental forms were given to 
groups of 7th, oth, and 11th grade pupils, 400 usable cases bein 
Obtained in all. Item difficulties were computed for each grade.) 


* Reference: Tyler. 


‘ 
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Nondiscriminating items were eliminated. Item validities were 
computed. The test was carefully equated to the Terman Group 
Test. It is available in two forms, C and D, so designated to be in 
Sequence with the two forms of the Terman Group Test. Split-half- 
reliability for grades 7 to 9 is .96. The standard deviation of the 
raw scores is 25.69. A table of mental age equivalents for raw Scores 
is provided. The manual states that I.Q.'s may be computed in the 
usual manner, i.e., by dividing mental by chronological age. It 
recommends the use of a modified chronological age beyond 13, 
with one month of age dropped for each 3 months of life. The user 
is cautioned that M.A.’s beyond 16 are scores, and not true 
mental ages. Hand- and machine-scoring procedures are pro- 
vided. 


3. Thorndike Intelligence Examination for High School 
Graduates 

One practical weakness and limitation of almost all the tests 
discussed so far—Army Alpha, the advanced sections of the Otis 
batteries, the Terman Group Test—is that they are too easy to be 
effective with the abler high school graduates and college students. 
When such tests are applied to groups of this kind, a great many 
Subjects reach the “ceiling,” so that discrimination is not adequate. 
The Thorndike Intelligence Examination was one of the first 
Sroup tests designed to meet this problem. 

It is a lengthy and difficult test, requiring two hours and fifty 
minutes of actual testing time. It comes in four booklets. The 
first contains practice material, consisting of samples of all types 
Of items to be used later, for which 15 minutes is allowed. The 
second booklet contains subtests involving directions, arithmetical 
Problems, information, opposites, word meanings, and so forth, 
and has a time limit of 45 minutes. The third booklet contains 
Subtests involving sentence completion, algebra problems, and 
information, and has a time limit of 50 minutes. The fourth book- 
let consists of questions calling for the interpretation of difficult 
Prose passages, and has a time limit of one hour. j 

Thorndike (1920, 1927 b) and Wood (g.v.) describe and discuss 
at considerable length the use of this test as part of the admissions 
requirements at Columbia University. It was found valuable by 
the administrative authorities, and in conjunction with other cri- 
teria Predicted college success approximately to a correlation in 

€ order of .60. 
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4. American Council on Education Psychological Examination 
for College Freshmen * 


This test bas largely superseded the preceding, though the latter 
still finds some uses. 

Quite apart from the outstanding competence of the work of 
construction, one of the great values of the American Council test 
is that it is revised yearly and has been since 1924, and that each 
year large amounts of data from the several hundred institutions 
where it is given are assembled, collated, and reported. This means 
that it embodies what in the opinion of its authors are the best 
practices currently available for tests of its general character, and 
that obtained scores can be interpreted along various lines with 
much confidence. ' 

An outline of the 1946 edition of the test is presented in Figure 
15. It comes in three forms, one for hand scoring, another for 


QUANTITATIVE TESTS 
(the Q score) 
I. Arithmetic 


2. Number Series 
3. Figure Analysis 


LINGUISTIC TESTS 
(the L score) 
4. Same-Opposite 
5. Language Completion 
6. Verbal Analogies 


Fic. 15. OUTLINE OF AMERICAN . COUNCIL ON EDUCATION PSYCHOLOGICAL 
EXAMINATION FOR COLLEGE FRESHMAN, 1946 EDITION 


(Thurstone and Thurstone, 1947) 


machine scoring, and another with a separate answer sheet for 

machine scoring. As will be seen from Figure 15, it consists of 6 

subtests divided into two parts. It yields two chief subscores, a 

“Q”’ score based on responses to subtests requiring quantitative 

insight, and an “L” score based on responses to subtests requiring 
# Reference: Thurstone and Thurstone, 1945, 1947. 
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language responses. These two subscores, together with the total 
score on the entire test, are recommended for purposes of student 
guidance. The scores on the 6 separate subtests should not be so 
used. Here again we have an instance of the profile scoring so im- 
portant at the present day. Thurstone and Thurstone are very em- 
phatic in their warning that the test does not yield either mental 
ages or intelligence quotients. They point out that scores of this 
kind are meaningful only within a certain age range, and that they 
do not apply properly to college students. It is interesting that 
such a caution should be thought necessary for the presumably 
well-instructed persons likely to use and interpret this test. The 
test embodies a great deal of practice material, which takes about 
one-third of the total time. This feature has been adversely 
criticized. 

Beginning with 1940 an analysis of item difficulty has been set 
Up, so that the gross scores on successive editions of the test since 
then are comparable. Among the many interesting items contained 
in the data presented by Thurstone and Thurstone in their 1947 
report, it appears that mean scores for all institutions using the 
1944 edition range from 128.44 for the highest to 33.42 for the 
lowest. Both are four-year colleges. This fantastic range of average 
Student ability is obviously full of formidable implications for 


American higher education. 


5. American Council on Education Psychological Examination 
for High School Students * 

This is another test put out at less regular intervals than the 
foregoing, by the American Council on Education. It resembles 
the test for college freshmen in general, but it is easier. The 1937 
edition consists of 4 subtests: (1) Completion, consisting of 55 
items and taking 14 minutes; (2) Arithmetical problems, 20 items, 
20 minutes; (3) Analogies, 29 items, 19 minutes ; (4) Opposites, 
54 items, 6 minutes. 5 


6. Ohio State University Psychological Test 

Here is another test which has appeared in a series of editions 
Over a period of years, and regarding which interpretive data have 
been systematically accumulated. A brief synopsis of the 1943-44 
edition appears in Figure 16. The test is very well known and 
Widely used. Hartson (q.v.) Presents some analysis of its relia- 


ফু 
Reference: Thurstone and Thurstone, 1937. 
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bility and validity based on data obtained {from its use at Oberlin 
College. He finds that its correlation with freshman scholarship 
ranges from .429 to .696. 


I. VERBAL OPPOSITES 
Thirty stimulus words followed by five other words each, task 
being to indicate the opposite of the stimulus word in each case. 


2. VERBAL AGREEMENT 
Sixty items, each consisting first of two words which establish a 
relationship (e.g., both being Plurals, noun-adjective relation- 
ship, etc.), then a third word, followed by five words, task being 
to indicate which of the five is related to the third as the first 
is to the second. 


3. PARAGRAPH READING 
Nine paragraphs, with four to ten questions on each. Literary, 
scientific, and mathematical material is used. 


Fic. 16. OUTLINE OF THE OHO STATE UNIVERSITY PSYCHOLOGICAL 
TEsT, REvisioN 22, 1943-44 


One or two comments should be made about tests designed for 
secondary and college levels, among the best of which the above 
Six are representative samples. 

First, their special orientation obviously makes the problem of 
validation more manageable than it would otherwise be. At the 
same time, one should be on one’s guard against regarding them 
simply as special aptitude tests. The authors of the American 
Council Test specifically maintain that it reveals certain general 
Psychological factors; to wit, quantitative ability and language 
ability. And it can be argued that the others reveal general in- 
telligence in a special setting and manifesting itself in special 
groups. If this is the case, they cannot be regarded as aptitude 
tests pure and simple. 

Another noteworthy consideration is that such tests are being 
developed by nonprofit organizations, such as the American Coun- 
cil on Education and the Ohio College Association Committee on 
Intelligence Tests for College Entrance. This makes possible the 
Publication of successive improved editions, the systematic accu- 
mulation of interpretive data over long periods of time, and the 
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embodiment in the instruments of statistical procedures and psy- 
chological conceptions irrespective of marketability, all of which 
are very important values indeed. The specific repudiation by 
Thurstone and Thurstone of interpretations in terms of mental 
ages and intelligence quotients, which are certainly misleading for 
the age levels concerned in spite of their popular appeal, is most 
refreshing. Also, their construction of a test which yields two 
Significant subscores instead of relying entirely on the usual global 
Or over-all score is a development of moment. It is at least a 
beginning, whether a successful one or not so far, towards the 
building of psychometric instruments capable of genuine psycho- 
logical analysis. Such developments are what one might expect 
from nonprofit organizations which can commit themselves to the 
Production of the best possible tests. Unfortunately this hope_is 
not always realized, for at least one of the major endowed univer- 
Sity presses advertises its tests in the spirit of the vendor of patent 
medicines. 


PERFORMANCE TESTS AND SCALES 


(These are tests in which the tasks set up require the subject to 
do Something” rather than to make a verbal response, e.g., to solve 
a maze, to assemble a pattern of blocks, to fit cutouts into the 
Appropriate holes in a form board, to assemble and put together 
Pictures presented part-wise on pieces of wood or card, perhaps to 
Select a proper implement for an indicated task, and so on.) Per- 
formance items and subtests appeared in the work of Binet from 
the very first, and they appear in all the scales discussed in the 
Preceding chapter, except the CAVD. The Army work, and more 
Particularly the Army Beta Test, gave a special impulsion to this 
development and led to the' construction of numerous group tests 
And scales consisting entirely of performance items. 

(The purpose is to get away from the verbalism of such intelli- 
gence tests as have been described, and to avoid its working 
limitation of applicability to those who can use English readily. 
Nn many of them language is used to a considerable extent, but 
also some are wholly nonlanguage, instructions being conveyed by 
emonstration. A performance test often includes items very like 
those in a test of mechanical or manual aptitude, but its purpose 
1S different /Tts aim is to measure general mentality in and through 
Manipylatio, rather than ability to manipulate in and of itself) 
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Five performance scales are presented below in chronological 
order, and then one test of a special and unusual kind is described, 
to which a good deal of attention has been given. 


1. Pintner-Paterson Scale of Performance Tests 


This is a pioneer performance scale, published in 1917, and 
abbreviated and revised in 1937. It is for individual administra- 
tion and consists of 15 subtests as follows. 

(1) Mare and Foal Test. This consists of a picture of a farm- 
yard showing among other things a mare and foal, with 11 pieces 
cut out. The task is to assemble the pieces into a picture. It is 
scored by time taken in seconds up to 5 minutes and by the num- 
ber of errors. The same scoring is used in the next 10 subtests. 
(2)Seguin Form Board. This is a board 20 x 1436 inches from 
which 10 geometrical shapes have been cut out. These are to be 
fitted into the proper apertures in the board. (3) Five Figure 
Form Board. This is similar to the Seguin Form Board but more 
difficult, because 11 pieces must be fitted into 5 apertures. (4) Two- 
Figure Form Board. Similar to the above but easier. (5) Casuist 
Form Board, considerably harder than the above two, requiring 
12 pieces to be fitted into 4 holes. (6, 7, 8) Somewhat similar form 
boards, with varying numbers of apertures and pieces. (9) Manikin 
Test. This consists of a doll in 6 pieces. The arms, legs, etc., are to 
be fitted into place, but the holes in the body for fitting together 
the various parts differ in shape. (10) Feature Profile Test. This 
consists of 8 pieces out of which to form the profile of features. 
(17) Ship Test. This consists of a ship picture in 10 rectangular 
pieces to be fitted together. (12) Picture Completion Test. This 
consists of a picture of a rural scene or scenes with 10 squares cut 
out, the task being to fill in by selecting the most suitable pictures 
out of several. The score originally was the number of blanks 
correctly filled out in 10 minutes, but 5 minutes was found usually 
to be ample time, so the test was restandardized for 5 minutes. 
(13) Substitution Test. This consists of rows of geometrical figures 
to be marked with numbers according to a given key. The score 
is the time required to mark 50. (14) Adaptation Board. A board 
with 4 round holes, three of them 6.8 cm. in diameter, and the 
other 7.0 cm. The subject is shown how one block fits exactly into 
the largest hole, and then told to put it into the right hole, the 
board being placed in four positions. The score is the number of 
right tries. (15) Cube Imitation Test. Consists of 5 black 1-inch 
cubes. Four of them are placed in a row about 2 inches apart in 
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front of the subject. The examiner taps the four with the fifth 
cube in various orders at a rate of about 1 per second. The task 
of the subject is to watch and then imitate. The score is the num- 
ber of correct tries. 

The scale yields a point score. Percentiles are worked out for 
each level. Tables make possible the computation of mental age 
for each subtest. It is suggested that the median mental age on 
all subtests be considered as a single representative mental age. 

Most of these items are in the nature of stock material for per- 
formance scales and tests, and have been used again and again 
either in the identical form here described or with minor vari- 
ations. An appraisal of them therefore carries far beyond this 
Particular instrument. 

(a) It is noteworthy that speed is an important factor in 12 out 
of the 15 subtests. This raises some question of their power to 
discriminate valid intellectual responses or responses calling for 
Beneral intelligence. Moreover, very slight differences in speed 
affect the score and may determine the difference between age 
level classifications. (b) Throughout the scale manipulative dex- 
terity and the control of small movement is involved. That this 
Opens the same question once again seems clear. (c) It is quite 
true that a systematic method of working will tend to lower the 
time taken and to increase speed. Here perhaps is the best reason 
for considering these items as significant signs of general intelli- 
fence. (d) In the Cube Test and the Substitution Test (13 and 


15) immediate memory span is involved. E. B. Greene (q.v.), 
ts are derived, points out 


from whom these evaluative commen v 
that they are based simply on inspection and conjecture, and not 
On statistical analysis. 
Some further light is 
Validity of this instrumen 


thrown upon the meaning and general 
t by an investigation reported by Mc- 

lurray (g.v.) Fifty children at or above 130 I.Q. on the Stanford- 

Inet scale and so children with I.Q.’s from 75 to go were com- 
Pared on the Pintner-Paterson scale. The resulting mental ages 
Were spuriously high for the bright group and spuriously low for 
the dull. Predictions of academic adaptation and achievement 
Were markedly wrong. It is to be noted, however, that the subjects 
in this study were young and that they consisted of children who 

eviated markedly from the average. The scale is considered a 
Valuable supplement to highly verbal scales, but it is not a good 
Substitute for them. It is particularly useful for the deaf, for those 
Who do not understand English, and for certain types of emo- 
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tionally disturbed subjects. It is not very satisfactory for older 
children, and older children who are dull often achieve high ratings 
which are probably spurious because of their manipulative facility 
(A. W. Brown 1941, Freeman, 1939). 


2. Point Scale of Performance Tests (Arthur) * 


This is also an instrument for individual administration, in- 
tended for ages from 6 years upwards. It is closely similar to the 
foregoing. In fact, Arthur restandardized all but three of the 
subtests of the Pintner-Paterson scale, the omissions being num- 
bers 8, 13, and 14. For an account of the processes of restandard- 
ization, see Arthur, 1933 ; and for interpretations of the scale and 
its results, see Arthur, 1930. 

Arthur introduced into her scale two subtests which do not 
appear in the instrument discussed above. They are the Porteus 
Maze Test (v. Porteus, 1915, 1924), and the Kohs Block Design 
Test (wv. Kohs). The Porteus test consists of 11 mazes of increas- 
ing difficulty. They are to be traced in pencil, which must not 
cross any line. If a line is crossed, the maze is withdrawn and a 
duplicate is given for another attempt. For the simpler mazes two 
trials are allowed, but for those intended for the 12- and 15-year- 
old levels four trials are allowed. A success obtained on a trial 
later than the first counts less on the score than a success on the 
firet trial with any maze. The test yields a total credit in terms of 
mental age. There is no time limit. Speed is not a factor, but 
Porteus believes the task to be indicative of prudent and careful 
foresight and choice. The Kohs Block Design Test consists of 
designs presented on 17 cards to be duplicated with colored cube 
blocks. All the cube blocks are identical, with four of their sides 
colored blue, yellow, red, and white, one side divided horizontally 
between blue and yellow, and one side divided horizontally be- 
tween red and white. From 6 to 16 blocks are needed to duplicate 
the designs as they increase in complexity. The test is scored on 
speed. Both these subtests are very familiar and widely used. ‘They 
appear in many performance batteries, and they have been adapted 
and re-edited in many ways. 

The Arthur scale is set up in two forms identical in difficulty. 
Form 1 was standardized on 1,125 children ranging from Sito £5 
years old. Mental age norms were worked out with 574 of these 
children for whom I.Q.’s had been obtained either on the Stanford- 
Binet scale or the Kuhlmann-Binet scale. The criterion used for 

* References: Arthur, 1930, 1933; Hilden and Skeels 
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the selection and inclusion of subtests was their power to dis- 
Criminate between the bright and the dull. Those subtests dis- 
criminating most markedly were given the most weight in com- 
puting the total score. Arthur reports that the Porteus Maze Test, 
and the Cube Test from the Pintner-Paterson scale showed the 
best capacity for discrimination. 

The general evaluative comments on this scale are substantially 
the same as those on the Pintner-Paterson scale. Wallin (1946), in 
one of the few studies comparing this test with the Stanford- 
Binet, reports a correlation of .72 with the 1916 Revision, using 
290 cases, and one of .53 with the 1937 Revision, using 172 cases. 
This is an unusually high relationship between tests of the type 
Involved. 


3. Cornell-Coxe Performance Ability Scale* 


This scale consists of 6 subtests of performance type, with a 
Seventh as an optional substitute for the third. They are as follows: 
(1) Manikin Profile Test, (2) Kohs Block Design Test, (3) Pic- 
ture Arrangement Test, (4) Digit-Symbol Test, (5) Memory for 
‘Designs Test, (6) Cube Construction Test, (7) Picture Comple- 
tion Test: 

The ret has been standardized on 306 cases extending 
from kindergarten through the eighth grade. This seems a meager 
Sampling for such an extended range. The authors adopt a curious 
and, so far as the present writer is aware, a unique way of deter- 
Mining mental ages. For them a mental age is neither the median 
Of the scores of a given age group nor the median of the ages of 
those making a given score. Rather, it is a somewhat arbitrarily 

etermined median between these two values which makes it 
ecidedly questionable and ambiguous. In its present form the 
Scale is at best a possible supplement to other performance scales, 
and might be used as a supplement to the more general scales 
IScussed earlier in this chapter. | . 
he question which evidently arises in connection with per- 
formance scales is whether and to what extent they are valid as 
Measures of general intelligence. ‘The evidence on this point will 
€ presented later in another connection. For the moment it may 
€ Said that they show only medium to low correlations with the 
Customary criteria, such as results on standard intelligence tests, 
Schoo] HChieVeen and the like (v. Gaw). As instruments for 
independent use their utility is limited. They often have clinical 


* 
Reference: Cornell and Coxe. 
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value, however, and may indicate certain types of mental and 
emotional disturbance. Also, they are often useful supplements 
for standard instruments for the measurement of intelligence. But 
the general view is that at least they embody a significantly dif- 
ferent conception of intelligence from that represented in the 
standard instruments previously considered and those to be dis- 
cussed later. In view of their emphasis upon speed, manipulative 
dexterity, neural control, visual memory, and the like, all these 
conclusions, which are supported by considerable statistical evi- 
dence, seem reasonable enough. 


4. Chicago Non-Verbal Examination 


\The test comes in one form, and is for ages 7 to adult. It con- 
tains 10 subtests as follows. (1) Digit-symbol, easy learning. 
(2) Indicating incongruous objects in a Pictorial representation 
of a series of objects. (3) Counting the numbers of cubes in pic- 
tured piles. (4) Duplicating a given shape by selecting appropriate 
segments pictorially shown. (5) Selecting from a series of designs 
one like a given design. (6) Arranging parts of a Picture to make 
it complete. (7) Arranging a pictured series of events to show its 
sequence in time, e.g., series of pictures of catching and losing a 
fish. (8) Showing the thing Wrong in a series of pictures. (9) Se- 
lecting from a set of pictures the one that goes with a given picture. 
(10) Digit-symbol, difficult learning.) 

Norms were established on EBA HEALG children. Mental age, 
percentile, and standard score norms up to the age of 14 are given. 
Beyond that level there are no M.A.’s. Reliabilities of .80 to .93 
are reported for groups ranging through distributions of 2 and 3 
years C.A. and from 2 to 6 grades. These are probably not high 
enough for intragrade or intra-year comparisons. Four validity 
criteria were adopted—correlation with chronological age, dis- 
crimination of normal from feeble-minded children, normality of 
the distribution of scores, correlation with other tests. On these 
criteria the validity is reported as “reasonably good.” The test 
purports to yield a global measurement of “nonverbal aspects of 
intelligence.” One point to be noted is that the pictorial material, 
so important in this instrument, is often very badly reproduced. 


5. Pintner General Ability Tests: Non-Language Series 


This battery parallels the Pintner General Ability Tests: Verbal 
Series. Its general layout is similar. It assimilates, with consider- 
able modification, the earlier Pintner Non-Language Mental Test, 
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Which was, up to the time of its publication, one of the most com- 
prehensive and excellent of nonlanguage tests. A synoptic outline 
of the Intermediate Test is shown in Figure 17. 

Pintner’s (1924) discussion of the earlier test probably applies 
quite well here also. Correlations with verbal tests are not high, 
running from .25 to .72. Correlation with a criterion of intelligence 


I. FIGURE DRAWING 
A geometrical figure is shown complete and then cut up. Task 
is to choose the one of 5 lines that would cut the complete figure 
as shown. 


2. REVERSE DRAWINGS 
A series of items each consisting of a geometrical figure shown 
complete and then reversed and with one line missing. Task is 
to choose the one of 4 lines which is the one missing. 


3. PATTERN SYNTHESIS 
Items consisting of 2 geometrical figures indentical in outline but 
with different internal segments shaded. Task is to imagine the 
first superimposed on the second and to tell which of 4 designs 


it would then resemble. 


4. MOVEMENT SEQUENCES 
Items each consisting of three figures with movement in given 
direction indicated by arrows. Task is to tell where the figure 
would be if the movement continued beyond the point shown 
in the third figure by choosing from 4 diagrams. 


5. MANIKIN 
Manikin figure shown in various positions and postures. The 
position of the arms in the first figure in each item to be matched 
from 4 others showing manikin upside down, etc. 


PAPER FOLDING 
Drawings of sheets of paper folded in various ways with small 
segments cut out. Task is to tell what each sheet would look like 
unfolded by choosing among 4 drawings. 


2 


Fc. I7. PINTNER GENERAL ABILITY TESTS: NON-LANGUAGE SERIES; 
INTERMEDIATE LEVEL. 


SYNOPTIC OUTLINE 


174 PSYCHOLOGICAL TESTING 


constituted by a composite of chronological age, school marks, 
teachers’ estimates, school progress, and four other intelligence 
tests was .78 for 235 children in grades 2 to 4. Correlation with 
teachers’ estimates alone was about 0. 
6. Drawing a Man * 

This test, intended for ages from 3% to 1 3% years, has attracted 
a good deal of attention because of its novelty. Goodenough con- 
cluded from her own work and from the investigations of others 
that drawing can be an indication of intelligence. This is the idea 
embodied in the present test. Instructions to the subject are as 
follows: “On these papers I want you to make a picture of a man. 
Make the very best picture that you can. Take your time and 
Work very carefully. I want to see whether the boys and girls in 

-+... School can do as well as those in other schools. Try very 
hard and see what good Pictures you can make.” The scoring 
depends on the presence of certain items, such as legs, attached 
legs, nose, fingers, etc., and not on art quality. The figure of a 
man was chosen because of its familiarity. Proportion and per- 
spective as well as enumerated parts are credited in the scoring. 
The points emphasized in scoring were chosen because they show 
a regular increase with age, and differentiate between children 
of the same age but in different school grades. The total possible 
Score is 51. Instances of means are as follows: For C.A. 3%, Score 
of 2; for C.A. 4%, score of 6; for C.A. 5%, Score of 10; for C.A. 
132, score of 42. Obtained reliabilities run from +77 to .93, but 
the scoring seems decidedly subjective. McCarthy (gv.),ina study 
of the reliability of this test, gave it twice at a one week interval 
to 386 grd and 4th grade children. Each test was scored three 
times, twice by the same scorer, and once by another. The correla- 
tion of scorings by the same person was ‘94, and 12.4% of the 
cases yielded a difference of one year or more M.A. Scorings by 
different persons correlated .90, but 25.3% of cases differed one 
year or more. An odd-even reliability of .89 was obtained. In gen- 
eral it does not correlate highly with other intelligence tests. In 
one study (McHugh, 1945) correlations of .45 = .06 on M.A.’s and 
41 + .06 for I.Q.’s were obtained with the Stanford-Binet 1937. 
These coefficients are higher than others reported. There is some 
reason to believe that it is definitely affected by environmental 
influences, for Indian children, particularly boys, from 6 to 11. C.A. 

* Reference: Goodenough, 1926. 
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Proved decidedly superior to whites, which' is thought to be due 
to emphasis on visual values in the upbringing of Indian children 
(Havighurst, Gunther, and Pratt). 

From the examples here presented, which, although few in 
number, are representative of the best work of its kind, the attempt 
to follow the lead set by the Army workers in the construction of 
Beta, and to build group intelligence tests of performance type 
1S much less happy and successful than in the case of group intelli- 
Bence tests. The items are often forced and trivial and of dubious 
relationship to intelligence, however ingenious they may be. For 
Instance, the imaginative fitting together of separate geometrical 
figures is not apt to occur in real life. Digit-symbol substitutions, 
Which are frequently used, may be suitable for code workers, but 
have little relevance to the doings of most people. Also, it rarely 
happens that one is called upon to detect incongruities in pictured 
Scenes. It may well be, as Porteus (1924) argues, that maze tests 
Which are also used, though they do not occur in the above ex- 
amples, can indicate continuous adaptation and planning and 
self-criticism, which are among the recognized attributes of intelli- 
Bence. But as Porteus also remarks, these values are nullified 
When, as often happens, maze tests are run under a time limit. 

So the tests of the kind under consideration are mostly far afield 
from the descriptions of intelligence advanced by Stoddard and 
Boynton ; and since item difficulty is not much considered in their 
Organization, they are also unrelated to Thorndike’s description, 
With its emphasis upon altitude. Certainly they do not coincide 
Closely with the two former descriptions, which make much of 
he importance and relevance of motivation, and of the intrinsic 
‘mportance of the tasks intended to reveal intelligence. As a 
Secondary but not unimportant point, it must be said that the 
Pictorial material so largely used is often of atrocious quality, so 
that one cannot really tell what is represented, and all sorts of 
Preposterous interpretations are suggested. Moreover, they are 

Performance” tests only by courtesy, for there are few paper- 
And-pencil tasks which are really direct “doings.” Making de- 
Clsions about pictures of piles of blocks is not a parallel for 
Actually handling the blocks, for instance. 


SUGGESTED ADDITIONAL READINGS 


For additional reading and more intensive study of the material in 
this chapter the most important sources are the tests and more par- 
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i ্‌ i d 
icularly the manuals of the tests discussed. Publishers will be foun 
TES the bibliography Of tests at the end of the book, Also the 
references mentioned in the text in connection with th 


+ 1936), Chapter 1 
, interpretation of results, and Syn- 


Y Valuable general chapter. 
Paul L. Boynton, Intelligence, its manifestatsi 
ment (New York: D. Appleton-Cen 


Edward B. G 


Ethel L. Cornell and Warren W. Coxe, A Performance ability scale 
VYonkers-on-Hudson, N. Y.: World Boo! 


Ese two reference 
and comment in regard to many 


QUESTIONS ror Discusston 


for intelligence? 
2. Bring together all the evidence on 


nce might be embodied in 
Verbal and in Performance type tests? Would the difference be a 


do with the Sort of items he 


vould You to expect fairly high correlations 
among verbal] group intelligence tests? 
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8. If differences between the results of such tests are largely a 
matter of the standardization groups, can you see any way of avoid- 
ing them? 

9. If someone asked you to recommend an intelligence test for 
use in a school, in personnel work, or elsewhere, what considerations 
Would you have in mind in choosing one? 

Io. In what respects and for what reasons might one expect better 
tests from nonprofit organizations than from commercial publishers? 


CHAPTER VI 


TESTS OF INTELLIGENCE (6100) 


INTRODUCTION 


Particularly true of group tests, 
for the application of instruments of measurement to the very 
young virtually requires individual administration. Some of the 


Wechsler-Bellevue scale extends upwards to the age of 60. Since 
these instruments have already been discussed under other head- 
ings, they will not be mentioned again here, except incidentally. 
We shall deal with tests specifically designed for Young children, 
and specifically designed for adults. Tests developed in World 
War II are of particular importance in the latter Category, al- 
though there have also been a few civilian tests of this kind. 


1. Minnesota Preschool Scale 


The Minnesota Preschool Scale 
to 6 years. It comes in two forms. 
is shown in Figure 18. 

As will be seen, it consists of 26 items, 
to those contained in the Revised Stanford- 
parable age levels. It was standardized on 0 
Social and economic status 


is designed for ages from 1 
A synoptic outline of the scale 


that can be managed with a Correctness of 50% 
The difficulty steps are rated in terms of the sta 
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(Kelley, 1916). 
ndard deviation 
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CMe EE 


I. 


13. 


14. 


25); 


16. 


17. 


Pointing out parts of the body 
Showing parts of the body on doll. 

Pointing out objects in pictures 
Show a chair, etc. 

Naming familiar objects 
Five actual objects presented for naming. 

Copying drawings 
Circle, triangle, diamond. 

Imitative drawing iy 
Experimenter draws lines, designs; child imitates. 

Block building J 3 
Twelve cubes to be built into various designs, copying experi- 
menter. 

Response to pictures 
‘Three pictures, task to tell what they are about. 

Knox cube imitation oo 
Four cubes nailed to base, one loose. Child to imitate various 
manipulations. 

Obeying simple commands Hg 
Handling and manipulating various objects on instruction. 

Comprehension Ee 0 
Telling what to do in various simple situations. 


+ Discrimination of forms 


Match form of actual objects from cards with pictured designs. 


+ Naming objects from memory 


A set of objects is shown and then covered. One is taken away, 
and the set is shown again, child being asked to tell which is 
gone. 
Recognition of forms be 
A picture is shown briefly, then removed, and the child is asked 
to match it from a card offering various choices. 
Colors 
Naming colors as presented. 
Tracing a form j 
Following forms with a pencil. 
Puzzle Series: Rectangular Series ্ 
Pictures dismembered in rectangular directions to be re- 
assembled. 


Incomplete i 
pictures 4 ? a 
Indicating omissions from pictures of simple objects. 
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18. Digit span 

Repeating series of digits given orally. 
19. Picture Puzzles: Diagonal Series 

Like 16 above, but diagonally disassembled. 
20. Paper folding 


Experimenter folds a sheet, child to do the same. 
2I. Absurdities 


Absurd sentences presented orally. 
22. Mutilated pictures 


Pictures with “something wrong” in them. 
23. Vocabulary 
List of words to be explained. 
24. Giving word Opposites 
List of words, task being to give the opposite of each. 
25. Imitating Position of clock hands 
Cardboard clock, 
holding out arms. 
26. Speech 
General record of an 
by the child is credit 


child to imitate hands Variously placed by 


Y sentence of five words or more spoken 

ed on his score. 

Fic. 18. MINNESOTA PRESCHOOL SCALE. 
SYNOPTIC OUTLINE 


of the score on the ass 
being to obtain scorin 
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Were based upon a group of 63r children from varied environ- 
ments, care being taken to avoid a preponderance of those enrolled 
in or on the waiting list of private schools. 

F The subtests were selected on the following criteria. (a) Attrac- 
tiveness to children. (b) Sufficient variety to sample a wide range 
of abilities. (c) Significant and marked differentiation of difficulty 
With age, so that 4 months’ difference in mean age was readily 


EIGHTEEN TO TWENTY-FOUR MONTHS 

2.* Throwing Ball 
Child is given tennis ball and told to throw it to person 
giving test. 

3.* Straight Tower 
Child is told to copy experimenter who builds tower out of 
scattered blocks. 

9. Repetition of Words 
Experimenter asks child to say 
other words all at same time. 

II.* Folding Paper 
Child watches while experimenter folds 
opens it out, then asked to do the same, 
little book.” 


TWENTY-FOUR TO TWENTY-NINE MONTHS 

13.* Identification of Self in Mirror 
Child shown own reflection in mirro 
name who it is. 

17. Drawing up String ¥ 
Child watches while experimenter pulls up & stick by a string 
tied to it, then asked to do the same. 

I0.* Questions 
Child is asked ten very simple questions. 


THIRTY TO THIRTY-FIVE MONTHS 
23.* Matching Colors 
Putting capsules colored red, blue, green, 
of same color. 


31. Seguin Form Board in F' 
Fitting the inserts into the ten spaces of the Seguin Form 


Board. Should take 222 seconds or less. 


“kittie,” then presents three 


sheet double and 
i.e, to “make a 


r and asked to tell by 


yellow into boxes 
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32.* Repetition of Word Groups 
Similar to repetition of words above. 
THIRTY-SIX TO FORTY-ONE MONTHS 
37.* Copying a Circle h a 
Child shown a card carrying circle r:inch diameter and asked 
to make one like it, using a pencil. 
46. Action Agent 
Child is asked “What sleeps?”, “What cuts?” 
© by indicating either the agent or the object 
FORTY-TWO TO FORTY-SEVEN MONTHS 
49. Seguin Form Board 
As above, but timed to 72 Seconds or less. 
56.* Copying Cross 


Procedure similar to copying circle above. 
57. Mare and Foal 


The various pieces of the 
fitted together to make th 

Which is presented. 
FORTY-EIGHT TO FIFTY-THREE MONTHS 
60. Seguin Form Board ঠি 
As above, but timed 

67. More and Foal 


As above, but faster timing. 
69. Four Buttons 


Buttoning four buttons attached to strips of cloth into button- 
holes. 


FIFTY-FOUR TO FIFTY-NIN 
82. Copying Star 
Similar to circle and cross above, 
SIXTY TO SIXTY-FIVE MONTHS 
‘Tests here are similar to above, but 
SIXTY-SIX TO SEVENTY-ONE MONTH: 
‘Tests here are similar to above, 


» etc., and scores 
of the activity. 


Puzzle completion board to be 
e picture of the mare and foal 


to 63 seconds, 


E MONTHS 


with higher norms, 
S 


but with higher norms. 
* Numerals are the serial numbers of the sub 


tests’ in CHE Seal INAS 
cate tests which do not appear at higher levels, le. Asterisks indi- 


Fic. 19. SAMPLE SUBTESTS FROM THE ME 


RRILL-PALMER SCALE 
(Stutsman) 
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indicated. (d) Avoidance so far as possible of influence by train- 
ing and environment. (e) Attainment of an equal spacing of steps 
or units of difficulty. (f) Universality of experience. (g) Ease of 
administration. The subtests were located in the age subdivision 
at which so% of the group were able to pass them. 

As will be seen, the scale is divided into age levels at steps of 
6 months. Tf a child passes half or more of the subtests at any 
level, he goes on to the next. One point of credit is given for each 
Subtest passed. Omissions and refusals are credited as successes 
when the number ascribed to the subtest in question is below the 
child’s total number of Successes, and otherwise as failures. Part 
Of the reason for this is to avoid an undue effect from the nega- 
tivism often shown by young children confronted by this or that 
item, which is a disturbing and falsifying influence in measure- 
ment at early stages. 

The scale yields three types of scores as follows. (a) Mental 
ages, norms for which are given in conversion tables showing raw 
Score equivalents. Thus a raw score of 47 indicates a mental age 
of 39 months. (b) Standard Scores, for which again conversion 
tables are given in the manual. Thus a raw score of 47 is at the 
mean for 39 months, 2 S.D. above the mean for gr months, and 
f S.D. below the mean for 55 months. (Cc) Percentile scores. Thus 
47 is at the median or soth percentile for 39 months, and at the 
95th percentile for 32 months. It should be noted that the scale 
BE not yield intelligence quotients, for a reason to be presented 

OW. 

Of the subtests, 16 are scored all-or-none, and 22 have variable 
Score values. The subtests are not massed according to the func- 
tions measured. Thus the instrument conforms to the Binet com- 
Posite pattern and is committed to the idea of averaging a number 
of undefined abilities. A 

With regard to validation the following positive evidence is pre- 
Sented. (a) The scale differentiates well between children rate 
On intelligence by the staff of the Merrill-Palmer School. (b) Total 
on the scale correlate with chronological age ‘925 he. 
a ton of 793 == 192, 702-0495 2nd 1-783 — ES 

e Stanford-Binet scale are reported. 

These correlations with the Stanford-Binet 


Seriously questioned. Wellman (1038) tested 2 


20 to 62 months old with the Merrill-Palmer scale, deriving per- 
centile scores, standard scores, and intelligence quotients. On 


scale have been 
gr children from 
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retesting somewhat later she found frequent gains, particularly 
among the superior children. In particular, she found quite a low 
correlation with the Stanford-Binet scale given at about the same 
time. DeForest (g.v.) also reports low correlations between the two 
scales, and finds that they tend to decrease at the age at which 


mental ages at various chronological ages is not stable. Thus an 


I1.Q. which would be 2.5 S.D. above the mean throughout would 
Vary all the way from 122 to 


above the mean at all ages wo 
1.5 S.D. above the mean at all 
and an I.Q. 2.5 S.D. below the 


ary and appraisal, the following points 
instrument is an excellent supplement 


» and postural behavior 


gh they are prominent in the 
reactions of children. Drawing, again, is significant enough to 


deserve inclusion in more than three of the subtests. Language 
could be more emphasized with probable advantage, the chief 
subtest recognizing it being the Action Agent, Probably the best 


subtest in the scale. (c) It makes more of the speed factor than 
could be wished. 


3. The California Preschool Mental Scale *# 


This is another instrument for individual administration, in- 
tended for ages from 1 to 6. In content it is in general similar 
to the two already mentioned. The subtests fall into ro categories. 
(1) Manual facility. (2) Block building. (3) Drawing. (4) Form 
discrimination. (5) Spatial relationships discrimination. (6) Size 

* Reference: Jaffa. 
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and number discrimination. (7) Language comprehension. (8) 
Language facility. (9) Immediate recall. (10) Completions. There 
are one or more subtests of each type at each age level. This is 
done in an attempt to make the scale homogeneous throughout 
the entire age range and constitutes one of its distinctive features. 

It yields three types of scores, to wit, approximate mental ages 
and resulting intelligence quotients, standard deviation scores, and 
profiles based on the various types of tests. The scoring has been 
found to be rather difficult, and the criticism has been made that 
the manual does not sufficiently describe the responses which are 
to be considered satisfactory in view of the very varied and fluid 
reactions of young children. Judging from the size of the standard- 
ization group the norms should have an adequate foundation, but 
Complete details are not given. 


4. California First-Year Scalet 

This scale, for individual administration, is intended for infants 
from 1 to 12 months old. The items and separate subtests of this 
scale are not dissimilar in general type from those usually found. 
distinctive feature of its construction is that it was based upon 
a sequential study of the same group of 6r infants who were 
tested at monthly intervals from 1 to IS months, and then at 18 
and 21 months. It was not always possible to secure all the mem- 
ers of this group for each testing in the sequence, but there were 
Never Jess than 46. Compared to the conventional standardization 
group this one, of course, is very small. But the principle of 
following up the same children over 4 considerable period of time 
is an excellent one. It should be noted that the children involved 

Were a rather highly selected group. | 
‘he scoring is either in absolute scale units, or in standard 
deviation scores, or in mental age units. The last named practice 
is not recommended. In general the scale is regarded as an excel- 


ent practical instrument. 


5. Iowa Tests for Young Children is 
This instrument, outcome of twelve years’ work, is for children 
4 months to 2 years old. It has 48 subtests, as shown in the 
Synoptic outline in Figure 20. 
feature of this scale, as may 


t Reference: Bayley, 1933- 
eference: Fillmore. 


be gathered from the synoptic 
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I. Sitting on lap unsupported 25. Putting cork in bottle 
2. Accepting second cube offered 26. Putting 2 boxes of nest to- 


3. Taking cup off hidden object gether 

4. Attempting to stand 27. Putting penny in bank 

5. Reacting to image in mirror 28. Piling biocks 

6. Locating sudden sound 29. Placing pegs in board 

7. Carrying ring to mouth 30. Placing cubes in box 

8. Examining object 3I. Putting 3 nest boxes together 

9. Looking on floor for fallen o0b- 32. Throwing ball to examiner 

< ject 33. Putting sand in jar, awkwardly 

IO. Sitting on table or floor un- 34. Pointing to object in picture 
supported 35. Putting key in padlock 

II. Attempting to ring bell 36. Cubes in box 

12. Showing interest in picture 37. Rolling ball to examiner 

13. Poking at pellet 38. Pointing to features 

14. Ringing bell 


39. Unscrewing jar lid 

15. Raising self by chair 40. Skeels form board 

16. Hunting covered object 4I. Cubes in box 

17. Picking up pellet 42. Skeels form board 

18. Walking with help 43. Naming objects in picture 
19. Trying to put cork in bottle 44. Putting sand in jar 

20. Trying to put penny in bank 45. Drawing circle, hand guided 
21. Marking with pencil 46. Putting all nest boxes together 
22. Accepting third cube 47. Matching boxes and covers 
23. Placing cubes in box, score 3 48. Skeels form board 

24. Trying to put sand in jar 


EET a [ES 


Iowa TEsts FoR YouUNG CHILDREN. 
SYNOPTIC OUTLINE 


Fic. 20. 


list of items presented, is its av 
material of a personal and soci 


degree between those who are 
4s 4 whole. This is perhaps th 
author is able to present, an 

Yet another feature of st 
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method used for determining the age placement and order of the 
items. The procedure, based on assigning them to age groups in 
Which they were passed by 50% to 60% of the subjects, was found 
unsatisfactory, and so they were placed “at par” according to the 
method developed by Thurstone (1925). The formula by which 
this was done was as follows: 


OPS 
Pp2 Pi 
y = age at which below 509% correct answers occur 
Pi = percentage passing item at age y 
pe? = percentage passing at age y+ 
(one class interval of age) 


6. Motor Achievement Test * 
This test has a somewhat different purpose and is of a some- 
What different type from the foregoing instances. It calls for per- 
formance in 4 categories of tasks. (1) Ball activities. Bouncing 
and throwing a ball in various ways. One type of reaction is for 
the child to bounce and throw the ball with a location field marked 
On the floor. He stands on the edge of it, and throws or bounces 
the ball to the examiner. Evaluation depends on his success in 
Staying within the indicated zone, on distance, and on his use of 
one or both hands. Another type of reaction is the catching of 
the ball when thrown to him by the examiner at the level of his 
chest. Two balls are used, one of 92 inches and the other of 
oe inches circumference. Three trials are allowed for each per- 
ine: (2) Hopping, Skipping, walking. The consideration here 
টী ability to maintain equilibrium. The test calls for walking in 
HE and a circle, the path being 10 feet long and 1 inch wide, 
a circle being 4 feet in diameter; and scoring depends on the 
b Imber of times the child goes off the indicated track, three tries 
eing allowed. As to hopping, this calls for hopping with Sd 
e chi 

imitates the examiner. umping. The child is required to jump 
BS: boxes of four HE 8 inches, 12 inches, 18 inches, and 
4 Inches. He is given three tries at each height, a 


» 
Reference: McCaskill and Wellman. 
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6 inches apart, another with 6 rungs 12 inches apart, both being 
at 45 degrees. 

LEE els aed to yield motor ages. These are determined 
by the age levels at which 50% of the tasks are satisfactorily 
passed. There are separate scores for the four categories above. 
These scores show fairly high intercorrelations, but they would 
probably be much lower if age were held constant. A tetest cor- 
relation of .98 is reported on repetition of the test after one week. 

Some light on the general significance of this instrument and 
the performances it elicits and endeavors to appraise may be 
gained from Bayley’s study of the relationship of motor and 
mental development in young children (Bayley, 1934). She finds 
that these two broad categories of behavior have different growth 
Patterns. Motor development in the first two years is more marked 
than mental development, according to her reports, after which 
it slows up. Little is known about the predictive value for later 


behavior of motor control at early ages, or about its relationship 
to and predictive value for mentality. 


7. The Developmental Examination 


An approach to the 
tality, and developme 
different from those so 


of Gesell, and more particularly 
sents the latest and most compr 
and Thompson, 1938. 

The developmental examination is a codification of the diverse 
and complex phenomena of human de 


velopment into a series of 
schedules. These schedules indicate the behavior Patterns to be 


da which pre- 


ehensive account, and also Gesell 
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expected for every 4 weeks of life from birth to 12 months, and 
thereafter for every 3 months up to 42 months. Two of these 
schedules are summarized in order that their general drift and 
significance may be understood. 

At twelve weeks the child, when supine, is found with head 
predominantly Half side, less fully rotated than when younger. 
Chin and nose are in line with median line of trunk. Arms are 
symmetrically disposed. When sitting, his head is set forward, 
erect, but bobbing and unsteady, somewhat thrust forward. Prone, 
he rests on forearms with arms flexed, weight resting on elbows 
and forearms. When presented with an inverted cup which is part 
of the testing material, he regards it more than momentarily and 
contacts it. So also with the cube which, too, is part of the testing 
equipment. When a ring dangling from a string, which again is 
Used in the testing, is presented to him he follows it visually for 
as much as 180 degrees. His vocalization is marked by chuckles 
just short of laughter. In his social-vocal response he vocalizes 
in some manner, or “talks back” in response to social-vocal stimu- 
lation. In his spontaneous play, he brings one or both hands 
before his face for regard. |, % . 

At thirty months the child walks on tiptoe, jumps with both 
feet, tries to stand on one foot when encouraged to do so and 


Shown how by the examiner, holds a crayon in fingers instead of 


in fist. He can build a tower of eight blocks which should be well 
enough constructed to stand alone. (Blocks also are part of the 
testing material.) He uses cubes to add a chimney to a train 
made out of cubes, when the examiner asks where the chimney is. 
He draws two or more strokes for a cross. He is able to place 
Correctly one of the color forms presented. He places three cut- 
outs in the form board, and adapts, although with error, when 
the board is rotated, thus changing its position relative to him. 

€ can give his full name. He can indicate the uses of the test 
Objects, such as keys, and so forth. In communication he refers 
to himself by pronoun rather than by name. He shows repetitive- 
Ness in speech and other activities. EF 

‘These schedules are the basis of a definite examination tech- 
nique, and rating blanks have been prepared for recording the 


ehavior interview. 
Rn in conjunction L 
cal evidence, such an examina 
and pertinent data. Gesell reports th 


with a proper interpretation of the total 
tion offers highly significant 
at such a comprehensive 
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ental examination competently interpreted is of high 
Eve value, and indicates later behavioral and mental status 
with a high degree of probability (see Gesell, 1940). f 
‘ The developmental schedules are available in separate printed 
sheets, suitable for convenient use. 


EVALUATION OF INFANT TEsTs 


In view of the great present interest in the possibility that 
environmental influences may greatly affect early growth, an 
informed and judicious opinion as to the values and limitations 
of the scales and tests by which infant mentality is determined 
is highly desirable. 

1. Clearly there are two approaches, related, no doubt, but in 
many essential respects very different, to the evaluation of infant 
behavior and development. The first is by instruments of measure- 
ment of the accepted and usual kind, centering about a loosely 
defined conception of what is to be measured, translating it into 
an array of items more or less relevant to it, standardized with 
reference to a sample group, and yielding various kinds of scores. 
The other, represented by the Developmental Examination, starts 
with direct studies of behavioral growth and its phenomena, and 


endeavors to chart it by characterizing it at various levels and by 
calling attention to various objective indications which may or 
may not resemble the responses elicited by conventional test items. 
Since much of the entire debate about the effect of early environ- 


ment upon mentality has turned upon the application of instru- 
ments of the former type, it is the value and meaning of these 
instruments which are of first concern here. 

2. It has been shown many times that the prognostic value of 
scales of the former type, such as the Merrill-Palmer Scale, the 
Minnesota Preschool Scale, the California Preschool Scale and 
the like, when judged by later test performance, is not very high 
A good and typical example of the outcomes of studies on: this 
problem is shown in Table 17. The data reported are derived from 
the sequential testing of the same group of 61 children in con- 
nection with which the California First-Year Scale was set up 
(Bayley, 1933). In all, 49 of these children Were continuously 
available throughout a three-year period. They were given a wide 
variety of tests, out of which the items for t 


he scale just men- 
tioned were selected. As the table shows, scores on Such tests have 
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& reasonably high predictive value over a short period of time. 
For instance, scores made during the 7th, 8th, and gth months 
correlate as high as .81 with scores made during the next three 
months, namely the roth zith, and 12th. But these former scores 
correlate only .22 with scores achieved during the period of the 


TABLE 17 


CORRELATIONS OF SCORES OBTAINED DURING CERTAIN PERIODS WITH 
SCORES OBTAINED DURING LATER PERIODS 


(Bayley, 1933 9, Table 9, P- 47) 


RE CORRELATIONS WITH AVERAGE SCORES FOR SUBSEQUENT 
নি চট FOR THREE-MONTH PERIODS 

REE-MONTH \ 
PERIoDs 256 | 89 ||501 12| 1374 75 | 78 21 24 | 27 30 36 
Le ESE | EES AEE 

42 28 ‘IO —.04 —.09 

‘72 -52 .50 23 .I0 

SI .67 .39 22 

SI .60 45 

.70 54 

.80 


iy goth, and 36th month. The significant thing to notice in 
b is table is the steady drop in indicated relationship as the time 
etween the initial testing and later testings increases. In the same 


Way, as is shown in Table 18, c | 
1 early age shows more and more instability as the time interval 
efore retesting becomes longer. It will b 


nitude of the mean changes, Doth positive and negative, 


marked increase as the time interval lengthens. The reader, of 


course, should understand that the intelligence q 
reported were obtained at very early ages, so that tte EE 


only a slight bearing upon the general problem of I.Q. constancy. 
ডর 2 children, using the 


Onzik (1938) again made a study of 25 
alifornia Preschool Scale for a8€5 from rr months to 5 years, and 
HO St ord Bnet EOL OnE 8 years. She found 
that a California Preschool Scale LQ obtained at about 21 monte 
is very unstable. and does not afford a reliable index of scores on 
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the Stanford-Binet scale later on. However, she also reports that 
by the time the age of 3 years has been reached, mental test 
performance seems to have become decidedly more stable, and 
that significant predictions can be made. Some of the data she 


TABLE 18 


AVERAGE CHANGES IN IQ's oF A GROUP oF Youxc CHILDREN OVER 
CERTAIN PERIODS 


(Bayley, 1940 b, after Table 3, P- 19) 


INCREASES DEcREASES No CHANGE 

AGEs (IN MONTHS) 

BETWEEN WHICH Amount Amount 

IL.Q.’s CouPARED Number clange Number change Number 
T2582 in scsmc tt seme 2I 6.72 23 5.06 হৰ 
BA —A0E ied os asics 22 9.86 22 5.59 [) 
OO—1O8 aici, soe ae °° 25 10.24 I7 4.53 3 
72~ 96 .. 29 9.09 17 8.36 [o) 
84-108 .. 28 14.45 16 6.27 [) 
72-108 32 12.64 I2 8.70 [e) 


yley in several Places points 
out (1933, 1939, 1940), that no uniform over-all global sco. 


Tre which 
can be derived from our Present tests has a high predictive value 
over the period of early development. She found a So-called “de- 
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velopmental score,” which was a composite of performance on 
mental and motor subtests, no better as a prognostic index than 
a mental score alone. Also, she reports that vocabulary tests at 
6 to 9 years are only moderately related to language tests at 3% 


TABLE 19 


CORRELATIONS OF TEST PERFORMANCE AT VARIOUS AGES WITH INITIAL 
STATUS AT I-9 YEARS AND FINAL STATUS AT 7-0 YEARS 


(Honzik, 1938, from Table 5, P. 295) 


Initial Status |Final Status 

Age in years and months 1-9 years 7-0 years 
.68 46 
*59 .38 
47 .56 
.50 63 
46 .66 
32 প্রঃ 
.30 ‘SI 
42 


years, and not at all to the age of first talking and to very early 
mental test scores. Form board and puzzle board performance at 
5% is related to scores on general mental tests and vocabulary 
tests at the same ages, but not to tests of general ability at the 
age of one year. She argues, and very soundly, that intelligence 
quotients reported a very early ages are deluding. This may very 
well be the case, for the uniform relationship between mental and 
chronological age upon which the whole meaning of the intelli- 
gence quotient entirely depends has not been established in the 
Standardization of tests for these early ages. The persistent use 
of the I[.Q. as an index of mentality in early childhood is yet 
another instance of falsifications due to the popularity of the 
measure. Bayley’s explanation of these summarized findings is 
that the whole mental make-up changes with development during 
the early years of life. Quite probably it does. But one must also 
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remember that little childr 
in work 


en are difficult test subjects, and that 
with them all sorts of 
So, too, somewhat low Co 


the various infant tests themselves, and between these tests and 
the Stanford-Binet Scale. i i 


With regard to th. 


Merrill. 


respectively, correlated 37 and .4 
tion ratings at 6 months. The suggesti 


has indicated that it measures the two 


factors of alertness and motor ability (Nelson and Richards, 


Richards and Nelson). 


t Children, on the other 
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That all development proceeds very rapidly in early life seems 
well established. And the changes may be qualitative as well as 
quantitative. Both Stoddard (1043) and Bayley (1933 a) have sug- 
gested that the organization of the mind may alter in itself in 
the first two, three, or four years of a person's life. 

D. It may be due to the cumulative effects of the environment. 
That environmental influences are peculiarly potent during early 
childhood is by no means an unreasonable proposition. If it is 
true, then test performance would clearly change, and the change 
would become greater with the lapse of time. As several investi- 
gators have remarked, among them J. E. Anderson (1939), chrono- 
logical age is by no means the uniform factor it is ordinarily 
assumed to be. Two children with a chronological age of three 
may have spent those three years very differently, and the dif- 
ference may affect their whole mentality. This, above all, is the 
point about which controversy is now centered. 

4. Atkins (gq.v.) has presented an excellent summary of the 
criteria by which a test intended for young children should be 
judged. (a) Its material should be intrinsically interesting, as one 
way of avoiding negativism and indifference. (b) It should require 
a minimum of oral directions, again to avoid as far as possible 
the effects of shyness and poor rapport. (c) It should demand 
only a brief span of attention for each item, since little children 
are high distractable. (d) The materials should be as simple as 
Possible. (e) So far as possible, the test content should be based 
on and selected in terms of equality of previous experience, 
although this cannot be fully attained. (f) So far as possible the 
test items should be noncommunicable, so that the child’s mother 
OF older guardian can be in the room while they are run, without 
risk of suggestions and expressions of approval, disapproval, and 
so forth. This again is for the sake of rapport, and to avoid nega- 
tivism and shyness. (g) Credit should be given for each actual 
response, not for two out of three, or only for all ten, etc. (h) 
Conditions for administration, and also the scoring instructions, 
should be as objective as possible. Many such tests place alto- 
gether too much reliance on the judgment of the examiner in 
both respects. (i) The standardization should be adequate. (j) The 
test should be set up for complete presentation of relevant data 
to make research possible. 

Virtually all tests for young children are open to criticism on 
one or more of these criteria. 
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5. To summarize, there is no doubt that for one En or 
another, and probably for several reasons, retest correlations, 
intercorrelations at identical ages, and correlations extending over 
considerable periods of time are lower with young children than 
with older children. Yet it is by no means justifiable to claim 
that test results with young children have no prognostic value, as 
some would seem to suggest, and that general conclusions based 
upon them are to be disregarded. An examination of the data 
presented in Tables 17, 18, and 19 bears this out. In Table IT it is 
shown that the results of very early testing have little relationship 
to test scores earned some years later. The same appears from 
Table 19. But when the three-year-old level is reached, the cor- 
relations begin to assume indubitable significance. Thus in Table 
19 it appears that test scores at age 3 correlate .56 with those at 
age 7, and many comparable or higher coefficients are to be found 
in these data. If this is not a finding peculiar to these particular 
investigations, but indicates approximately the true relationship, 
it then compares quite favorably to the relationship between 
mental test scores obtained in high school, including senior year, 
and achievement in college, which is surely about as much as one 
could reasonably expect. Indeed, as one surveys these figures as 
a whole, it is difficult to find in them anything disastrous to the 
status and general prognostic value of tests for young children. 
Of course, such tests must be used and their results must be 
interpreted with special care—although such a reservation can 
very well be made with respect to all psychometric results. No 
doubt, too, as J. E. Anderson (1939) and others point out, the 
younger the subject, the greater the risk of error. But on the face 
of the evidence, persons who do not like certain conclusions, spe- 
cifically regarding the influence of the environment, which have 
been drawn from the testing of young children, must rebut them 


in some more convincing way than by a wholesale attack upon 
the scales themselves. Unless, indeed, the 


! Y are prepared to reject 
every kind of psychometric evidence. 


TrstNG ApuLT INTELLIGENCE 


| ‘ Turning now to the other extreme of the age range, namely 
adult mentality, still further practical and theoretical psycho- 
metric problems and issues of the first importanc 


fh € are involved. 
‘The situation can be summed up briefly and simply. Very few 
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existing instruments of mental measurement are well suited to 
deal with adults. When so used, the tests that are available are 
apt to lead to very misleading conclusions. It is both important 
and revealing to analyze the reasons for this striking limitation, 
and to ask how and whether it can be overcome.) 


1. Reasons for the unsuitability of existing tests 


( Most, although not all, existing tests are not well suited for the 
méasurement of adult intelligence)for the following reasons. 

A. Tests have been prevailirgly standardized on persons in 
school. This is partly, though not entirely by auy means, a matter 
of sheer convenience. It is much easier to secure standardization 
groups of adequate size from school populations than from among 
independent adults. One thing that made the standardization of 
the Army tests feasible was that the subjects were under orders, 
and so could be made available as needed. But to do this with 
adults from the general population, and to secure adequate num- 
bers willing to submit to the sometimes rather lengthy processes 
necessary, iS quite another matter. So school groups have been 
very largely drawn upon. This involves several limitations. (a) It 
means that standardization is run on persons within a limited 
range of age. (b) It means that whereas the norms may represent 
an unbiased sample of the school population, they almost certainly 
involve a certain bias with respect to the general population. The 
mentality of pupils in school is probably different in various 
undefined but not unimportant respects from that of independent 
unselected adults. (c) Perhaps most important of all, such stand- 
ardization groups very easily involve a factor of selectivity, par- 
ticularly if they are chosen from the upper grade levels. Thus, 
when the norms so obtained are applied to the general population, 
the results are often disconcerting and even fantastic. The out- 
standing instance of this is the finding that the average mental 
age of the population of the United States is approximately 13 
years, which came from the first Army testing program. 

B. But the use of school groups for purposes of test construc- 
tion involves something far more than mere external convenience. 
A pragmatic working conception of intelligence pervades the entire 
milieu, and operates as a kind of implicit major premise which 
influences test construction in all its aspects. That conception is 
by no means invalid or erroneous, but it is limited and special. 
And its effect is constantly present. Tt provides ready-made, easily 
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available criteria for validation, which are easily expressed in 
numerical terms suitable for statistical treatment. These are the 
various measures of school progress and school achievement which 
can, if needed, be supplemented by teacher ratings. They are the 
criteria very widely used for item selection on the basis of power 
to discriminate the bright from the dull, and for the over-all 
validation of finished tests. There is a general but none too explicit 
agreement as to what validity actually means in practice, for the 
reason that the test is constructed in and for a milieu Where an 
institutional conception of intelligence is operating. A valid test 
is one that agrees with this Conception, which, to repeat, probably 
has authenticity and general significance within limits but is cer- 
tainly specialized to an undefined degree. We know what intelli- 
gence means in terms of school life and experience. It means 
school success. This is by no means unrelated to success in life in 
general, and particularly to the intellectual tasks of life. But 


manageable institutional 
d from time to time. One 
nd residence in homes for 


1 t » though less in extent, has been done in 
connection with the testing Progra 


j » lawyer, engineer, public rela- 
tions man, reporter, chief clerk, teacher, dra 


’ Are teamster, miner 
lumberjack, barber, laborer, truck driver. { 

There does seem to be s 
intelligence, although it is not a ve 
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ing criterion for test construction would be impossible. Probably 
it is true that for many occupations a certain minimal intelligence 
is necessary. But the reason why persons of superior mental en- 
dowment are not usually found in unfavored vocations is not that 
they could not succeed in them, but that they do not like them. So 
Occupational-intelligence ranking is not at all simple or unequiv- 
Ocal, and to select test items because, for example, they dis- 
criminated between bookkeepers and station agents would be 
fantastic. 


TABLE 20 


INTELLIGENCE RATINGS IN VARIOUS OCCUPATIONS 
(Selected and adapted from Fryer, 1922) 


— 
Intelligence Group A Intelligence Group GC 
Engineer Locomotive Engineer 
Clergyman Policeman 
Accountant Toolmaker 
চ Actor 
Intelligence Group B TEED 
Physician Painter 
Teacher Sy 
Accountant Intelligence Group C— 
Dentist Hospital Attendant 
Shoemaker 
Intelligence Group CH SHOE 
Bookkeeper Textile Worker 
Photographer Yj 
Railroad Conductor Intelligence Group D 
Electrician Fisherman 


Druggist 


The truth is that in order to be of service in validation, our 
operating conception must express itself not merely in a 
verbal formula but in operating and tangible form. In practice this 


means that it must express itself in some kind of institutional 


milieu. Such a milieu is provided in an almost providential manner 
Dy the school. But for the general adult population no such simple 


Practicable criterion is at hand. 
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C. One of the assumptions underlying much test construction 
is the existence of a determinate relationship between chrono- 
logical age and mentality. Such procedures as bringing the mean 
test performance and the mean chronological age of age groups 
into equivalence, locating subtests at the point where they are 
passed by 50% of the age group, locating them “at par,” selecting 
test items because they show increasing scores with advancing 
age levels, and so forth, are simply ways and means of putting this 
hypothesis to practical use. If it were not true, they would all be- 
come meaningless statistical manipulations. There is every reason 
to believe that the assumption, with certain qualifications no 
doubt, is substantially true with young people. But when it comes 
to mature human beings, a very serious doubt arises. This is why 
on both the Stanford Revisions of the Binet scale all mental ages 
for the upper levels are derivative. Certain questions and objec- 
tions can be*and in fact are raised concerning mental age ‘deter- 
minations even with the young. But at least this kind of Statistical 
interpretation of test performance has a Ponderable reason back 
of it here. But when adults are concerned, the support for this 
practice becomes much more shaky and uncertain. This, of course, 
is why Wechsler, in constructing a test designed to deal with adult 
intelligence, abandoned the whole concept of mental age. One 
cannot but feel some regret that he still retained the intelligence 
quotient, at least in name. To argue in favor of these measures, 
as Terman and McNemar have done, that laymen find them easy 
to understand, is not at all convincing. Perhaps what laymen 
really find easy is to misunderstand them. 

D. Mature persons are likely to become considerably, and 
indeed sometimes extremely, specialized in their interests and 
their patterns of mental activity. An automobile mechanic, for 
instance, may be highly ingenious, adaptive, and resourceful in his 
own field of endeavor, but quite the reverse if he has to deal with 
problems of salesmanship or administration Or finance. Presuma- 
bly one has to believe that intelligence possesses certain Universal 
characteristics. But to deal with them for the Sake of measure- 

‘ ment, or for any other Purpose, they must be approached through 
their special manifestations. This is much more difficult with the 
highly specialized adult than with the relatively Uunspecialized 
child. Tt follows, therefore, that the sort of Over-all average rating 
which is the true psychometric meaning of general intelligence in 
the Binet tradition (and Binet also worked with children) is much 
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more likely to approximate the true mentality of a child than of 
an adult. 

Along the same line, too, it should be pointed out that the back- 
grounds of a group of children are almost certain to exhibit. more 
uniformity than those of a group of adults, for the very simple 
reason that cumulative differences have had less time to build up. 

So, while it is presumably true that intelligence has the same 
universal characteristics in children and adults, its manifestations 
in the latter become more complex, diverse, and specialized. This 
is one of the chief reasons why it is harder to deal with adult 
intelligence by psychometric techniques. 

E. Finally, conventional test material is not well suited to 
adults. This is almost inevitable, because the great pool of test 
items which has been accumulating through the years has come 
in the main from tests designed for, tried out with, and adminis- 
tered to children and young people. The kind of test items com- 
monly used often strikes adults as silly, trivial, merely manipula- 
tive, concerned only with word juggling, requiring nothing but 
information, etc., etc. Such objections apply both to verbal and 
Performance material. Above all, there is a strong tendency in 
many tests to emphasize speed, whereas the desire of many adults 
is to ponder before deciding. When one recalls the strong and most 
emphatically legitimate insistence of Stoddard and Boynton that 
intelligent behavior cannot be dissociated from attitude, motiva- 
tion, and a sense of the significance of the task, the seriously 
disturbing effect of such material becomes abundantly evident. 


2. Evaluation for adult use of tests already discussed 


With these problems in mind, it is desirable to evaluate well- 
known general tests from the standpoint of their suitability for 
adult use. This can be done quite briefly, since representative 
instances have already received a more extended general dis- 
Cussion. 

A. Army Group Intelligence Examination Alpha. This was con- 
structed and used as a test for adults. It was aimed at a group 
With a very special background, but specialized items can readily 
be edited out, as the various revisions have shown. It has, how- 
ever, many limitations for general adult use. The speed factor is 
unduly prominent. It is a purely verbal test. The item content 
often strikes adults as trivial. But above all, it is so easy that it 
Cannot reveal the mentality of highly endowed adults. For adult 
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groups of low to mediocre intelligence, one or other of the re- 
visions of the original Alpha may be serviceable instruments, 
though even so the factor of adult specialization readily vitiates 
the ratings. One may say that even with adult groups who will 
not reach its ceiling it should be used in conjunction with other 
tests designed to reveal special aptitudes. 

B. Otis Quick-Scoring Tests 0f Mental Ability. Much the same 
considerations apply here, except that the test does not contain 
such specialized items as Army Alpha. But it does emphasize 
speed of response. It is too easy for a considerable percentage of 


the adult population. And it has the limitations of any general test 
When so used. 


C. Thorndike Intellig, 


Purpose, and with qualificatio 
D. American Council on E 


Moreover, 
example of a comprehensive sam 
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cepted as in many respects the most satisfactory all-round adult 
intelligence test. It was standardized on adult groups as well as 
adolescents. Performance on the scale is not interpreted in mental 
ages. It extends upwards to advanced age levels. It yields not only 
a global measure, but also scores based separately on performance 
subtests and verbal subtests, both of which it contains. And as 
Rapaport and others have shown it has considerable diagnostic 
efficiency which could probably be increased if further attention 
Were given to the profiles it yields. Of course, however, it has limi- 
tations. It is an individual test, which implies various practical 
drawbacks. It has no decisive external validation, although in- 
creasing experience with its use supports it. And it is by no means 
So suitable as some other instruments for special purposes and 
special groups, particularly for the prediction of academic success. 


TEsTs DESIGNED SPECIFICALLY FOR ADULTS 


1. Wonderlic Personnel Test * 


This is an adaptation specifically for adult use of the Otis Self- 
Administering Test of Mental Ability, Higher Form. The latter 
1s regarded as too easy for general adult use, and accordingly the 
difficulty level is raised and a more even order of difficulty pro- 
vided. The test is greatly abbreviated, and can be run in 12 minutes 
as compared to go minutes for the Otis. In content, it is a 
scrambled omnibus arrangement of 50 items. The title is chosen 
to avoid what the author believes to be the alarm engendered in 
ne an intelligence test. The self-administering feature is re- 
alned. 

Because of the specialized nature of the test, norms are not 
computed for an unselected population. Instead, they are devel- 
Oped for representative industrial and. business groups, to wit, 
Outside representatives, clerks, managers of local offices of the 

Ousehold Finance Corporation, vacuum cleaner salesmen, typists, 
Hollerith key operators. Also there are norms for educational and 
Sex groups. Scores tend to fall with age, and a formula for com- 
Pensation is provided. It is found to have about the same reliability 
as the Otis, with which it correlates from .81 to .87.- Validation 
data based on occupational success indicate that, of managers 
Who score 25 or less, 78% fail. The mean for successful employees 


1S 29.7, and for unsuccessful 25. 
* Reference: Wonderlic and Howland. 
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2. Army General Classification Test * 


This is one of the basic intelligence tests developed in World 
War II. A synoptic outline is presented in Figure 21. 


Part 1, Sentence Completion 
Incomplete sentences followed by five completing terms, one to be 
chosen. 

Part 2, Opposites 
Series of terms followed by five other terms, the one most nearly 
opposite to the first to be indicated. 

Part 3, Analogies 
Series of incomplete analogies followed by five terms, the one 
completing the analogy to be indicated. 


Fic. 21. ARMY GENERAL CLASSIFICATION TEs. 
SyNoPTIC OUTLINE 


TABLE 21 
RATINGS ON ARMy GENERAL CLAssirIcATIoN TEsT 
(Staff, Personnel Research Section, Adjutant General’s Office, 1947, p. 393) 


ARMY STANDARD ScoRE Raxcs 


Army Grade Through June 1042 From July 1042 
) 130 and higher 130 and higher 
II II0-129 IIO-I129 
III 90-109 90-109 
IV 70-82 60-89 
Vy 69 and lower 


59 and lower 


* Reference: Staff, Personnel Research Section, Adjutant Generals Office, 1047. 
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The test is unsuited to those with less than a fourth grade school- 
ing. Standard scores of less than 50 should be disregarded as prob- 
ably due to illiteracy. Low correlations with rank in service have 
been reported (Duncan), and correlations around .55 with amount 
of education have been found. The test has been used for pre- 
liminary classification for various military occupations, and test 
ratings appropriate for many of them have been published 
(Harrell). 


3. Army Individual Test of General Mental Ability 


The new Army Individual Test of General Mental Ability is 
also of interest, which is heightened by a comparison with Army 
Alpha of nearly thirty years ago. The new test is an individual 
Instrument requiring 40 minutes’ time. It consists of three verbal 
Subtests (story memory, similarities-differences, vocabulary) and 
three nonverbal subtests (trail-making, cube assembly, shoulder 
Patches). Quite possibly it may come into wide use after suitable 
revision, for it has considerable specific military content and bias 
(v. Staff, Personnel Research Section, Classification and Replace- 
ment Branch, Adjutant General's Office, 1944). 


4. United States Armed Forces Institute Tests of General 
Educational Development 


These might be considered educational tests, but their scope is 
So broad and their emphasis upon mental processes so definite 
that they can well be classified as tests of general intelligence. 
Their purpose is general educational and vocational guidance, the 
educational placement of returning service men, and also the de- 
termination of the educational status of those not intending to 
Continue their schooling. They deal with the interpretation of 
reading materials in the Social studies and in the natural sciences, 
the interpretation of literary materials, and also correctness and 
effectiveness of expression as shown in the making of corrections 
and improvements in printed passages originally well written but 
deliberately corrupted. There is also a test of general mathe- 
matical ability for high school level only which calls for the solv- 
ing of various practical problems, and more specialized mathemati- 
Cal tests for college level. | 

As may be gathered, the tests are not directed towards specific 
Content, but towards the power to interpret and evaluate written 
material. This naturally calls for a background of substantial 
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knowledge, but the emphasis is upon generalized intellectual abili- 
ties. There are two batteries, one for high school level, the other 
for college level. There are two equivalent forms for each battery, 
one military, the other civilian. Norms for each test are developed 
on student populations enrolled in appropriate courses. For in- 
stance, the test in correctness of expression was standardized on 
students in freshman English; the test in social studies on stu- 
dents in survey courses in the field, among others. Norms are pre- 
sented for three types of institutions classified on mean freshman 
scores for 1941 on the American Council on Education Psycho- 
logical Examination. The three types are institutions with mean 
Scores of over 113, those with scores from 113 to 95, and those be- 
low 95. The tests are work-limit, not time-limit tests, as they are 
intended to measure power. Actually 120 minutes is ample time 


for most individuals on the college tests, and 95 minutes on the 
high school test. 


One notable technical develo 
World War II has been the wide 


pment in psychometrics during 
use of very brief tests for “‘Screen- 
ing” purposes, i.e., for quick classification and disposal (v. Hunt 
and Stevenson; Hunt, Wittson, and Harris). Such tests include 
abbreviations of the Wechsler-Bellevue scale, such as those by 
Rabin (9.v.), Geil (g.v.) and Gurvitz (9.v.), and also man 
The same tendency in connection with civi 
noted, particularly with reference to tests by Otis, and 
reference to the W. 


a great impetus in the armed services. The fact that su 


I € common objec- 
alid a test must be lengthy. 
5. Desirable characteristics of adult tests 


Stoddard (1943) on the basi 


| practical experience, has set up the following requirements for 
! any adequate program or battery for the measurement of adult 
\ intelligence: 


faa. General tests of adult intelligence 


| (1) Tests of Seneral comprehension 
(2) A logic test 
| (3) A test of Plasticity 
material) 
| (4) A test of conc 
(5) A test of conc 


(learning and retention of new 


epts of Personality and behavior 
epts of social responsibility 


2D tn. Duane 
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B. Special tests of adult intelligence (illustrative items) 
(1) Advanced reading and comprehension 
(2) Concepts in mathematics and the physical sciences 
(3) Concepts in the social sciences 
(4) Concepts in the fine arts 


(5) Concepts in the humanities 
| (6) Concepts in applied arts, crafts, and vocations” 


(Pp. 157) 


Under heading B, the subject would be given a choice, except 
for the first item—advanced reading and comprehension. Stoddard 
also argues that such a battery should contain two further ele- 
ments. The first would be an opportunity to produce original 
solutions and constructions, not to be scored by a key but to be 
rated by a committee of judges. The second would be a test 


revealing power “to resist the strongest forces of suggestion and 


irrationality available within the practical limitations of testing” 


(Pp. 155-6). Such a battery might yield a single global score, but 
Would certainly yield profiles based on indices of the various 
attributes or manifestations of intelligence. This conception of an 
adequate scheme for the measurement of adult intelligence is the 
clear consequence of Stoddard’s description of intelligence, which 
has already been discussed in this book. As the reader will remem- 
ber, he thinks of intelligence as the ability to undertake activities 
characterized by difficulty, complexity, abstractness, economy, 
adaptiveness to a goal, social value, the emergence of originals, 
Persistence, and resistance to misleading distractions. 

This recommendation raises once again a crucial issue to which 
Attention has already been called several times. It is the desirabil- 
ity or undesirability of expressing the results of mental measure- 
Ment in a global or over-all score, i.e., a mental age, or an intelli- 
fence quotient, or a percentile or standard score, or what not. The 
Practice is being very stringently questioned in current psycho- 
EtG discussions. As has been seen, there is already a tendency 
towards the construction of tests that do not yield such global 
Scores ; or if they do, also offer profile scoring as a substitute or 
alternative. Instances are the latest editions of the American Coun- 
Cil on Education Psychological Examination for College Fresh- 
men, the California Test of Mental Maturity, and above all the 

hicago Tests of Primary Mental Abilities. 

Global scores which consolidate in a single measurement per- 
formance on a variety of items have proved reasonably satisfac- 
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tory in tests for school purposes. But, as has been argued here, 
this is because an effective institutional definition and criterion 
of general intelligence is present. That this is indeed the reason 
is suggested by the fact that such tests have proved far less satis- 
factory and convincing outside the educational milieu. Thelma 
Hunt (g.v.) summarizes a number of studies on the use of intelli- 
gence tests for vocational purposes. Of 195 business concerns re- 
plying to a questionnaire, only 17% used such instruments in their 
employment procedures. Of 36 states replying, 11 had some cen- 
tralized personnel agency, and of these 7 used intelligence tests. A 
sampling of reports on the relationship between intelligence test 
performance and vocational success gives correlations of .22 for 
stenographers, .34 for reformatory officers, .50. for patrolmen, .31 
for firemen, .28 for bank examiners. If these are truly representa- 
tive figures, it is clear that existing intelligence tests do not meas- 
ure effectively what it takes to do well in a great many jobs. But 
they are designed to measure primarily intelligence as it manifests 
itself and leads to success in the job of going to school. For this 
their global ratings have proved reasonably satisfactory. 

The argument, then, is for profile or specialized scores and rat- 
ings. Several times already we have noticed this trend in connec- 
tion with tests that have been discussed. And it is, indeed, the 


outstanding feature of the new types of tests now to be con- 
sidered. 


EMERGING TyrEs oF TrsTs 


Thorndike (1928), in an appraisal of the testing movement up 
to that time, foresaw the development of psychometric instruments 
which would measure well-defined mental functions, and would 
center down exclusively on the functions as conceived. This is in 
contrast to tests of loosely defined general ability or general intel- 
ligence such as developed out of the work of Binet and out of that 
done in the United States Army program. There are three recent 
instances of just such tests now to be considered. 


1. a, b. Chicago Tests of Primary Mental Abilities * 


This exceedingly important battery appeared in two consider- 
ably different editions, so different, indeed, that they may almost 


* References: Thurstone, 1938; Thurstone and Thurstone, 1041. 
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be called different tests. The experimental edition was published 
in 1938, and the definitive edition in 1942. They represent a depar- 
ture in mental measurement, either the success or the failure of 
which will be momentous. 

The work was based on Thurstone’s monumental research in 
factor analysis published in 1938. He gave 54 tests of various 
kinds to 240 college students, obtaining in all about 1,500 correla- 
tion coefficients. These were subjected to analysis to determine 
the underlying factor pattern which would explain the interrela- 
tions of test performance. He identified in this way a number of 
“primary mental abilities,” i.e., basic mental functions cutting 
across and involved in many different mental operations and types 
of test performance. Those which received specific designations 
Were as follows. (a) P, i.e., perceptual ability. (b) N, i.e., numeri- 
cal ability. (c) V, i.e., verbal ability. (d) S, i.e., spatial visualizing 
ability. (e) M, i.e., memory. (f) I, i.e., inductive or generalizing 
ability. (g) D, i.e., deductive or reasoning ability. 

The test battery built to measure these primary abilities con- 
sists of 16 subtests in three booklets. Instead of a total or global 
Score, it yields a profile showing the distribution in the subject of 
the seven basic traits or abilities. Since it was experimental, and a 
definitive edition has now appeared, it will not be described in 
detail here. 

The second or definitive edition is shown in outline in Figure 

22. As will be seen, it consists of 11 subtests instead of 16 as in 
the experimental edition. The list of primary. abilities embodied 
and revealed in the tests has also been modified. Perceptual ability 
and inductive ability do not appear. The designation W is new, 
and stands for word fluency, or the ability to think of and use 
Words rapidly and copiously. The designation D is altered here 
to R, which stands for reasoning ability. This test, like the pre- 
Vious one, yields a profile in terms of the designated basic mental 
abilities as defined. 
. Since the battery has not been out long enough for definitive 
investigations to have appeared, Such evaluation as can be made 
Will concern the experimental tests. But it probably applies to 
the former as well. B 

First of all, this clearly isa development in test construction 
Which is of major importance. Basic concepts are defined with 
Precision. The idea of a loose average sampling of general intelli- 
gence is given up, and with it goes the familiar global score or 
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1. ADDITION (N)* 
Sets of columns of numbers with total given, latter to be 
marked right or wrong. 
2. MULTIPLICATION (N)* 
Sets of multiplications with products given, same task as above. 
3. VOCABULARY (V)* 
Set of stimulus words with four words following, task to choose 
the word with the same meaning as the stimulus word. 
4. COMPLETION (V)* 
Problem statements calling each for one word in response. Five 
letters given, task being to indicate the initial of the correct 
response word. 
5. FIGURES (S)* 
Stimulus figures each followed by six figures one of which is 
the stimulus figure in a new position, task being to select the 
stimulus figure. 8 
6. CARDS (S)* 
Stimulus pictures of cards in various geometrical shapes, each 
followed by six others, one being stimulus figure in novel posi- 
tion, task being to identify it. 
7. FIRST LETTERS (W)* 
ধ Writing down as many words as possible beginning with a given | 
letter. 
8. FOUR-LETTER WORDS (W)* 
Writing down as many four-letter words as possible beginning 
with a given letter. 
9. LETTER SERIES (R)* 


Deciding What would be the proper next letter in various letter 
Series. 


10. LETTER GROUPING (R)* 


Letter groups made on Various principles, three groups in each 


series the same and one different, task being to identify the one 
different. 


II. FIRST NAMES (M)* 


Dae i 5 
Practice exercise on a series of first names connected with last 


names, then test Choosing the right first name for given last 
name from seven alternatives. 


* The letters indica 


te the primary ability involved in the test. 


Fic. 22. CHICAGO TESTS OF PRIMARY MENTAL ABir 


TIES. 
SYNOPTIC OUTLINE 
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rating, whether it be of the nature of a mental age, a percentile, 
or a standard score. The hypothesis is that we are dealing with 
fundamental psychological components and processes which are 
to be found in many aspects of mental life and behavior. The 
individual is rated on the pattern or profile of these basic com- 
ponents which the test reveals in the make-up of his mind. Just 
how certain it is possible to be that this particular list contains 
such authentic component elements of mentality is a question 
that will be postponed until the technique of factor analysis by 
Which it was obtained comes up for consideration. For the moment 
it is sufficient to remark that Thurstone obtained what he calls 
primary abilities from an analysis of test performance rather than 
from a direct experimental analysis of human behavior. This at 
least raises some doubt. 

As a practicable battery the instrument has been found some- 
what disappointing, in spite of defenses based both on theory and 
On its predictive efficiency (Crawford 1940, Shanner). It is very 
ong, though this would hardly matter if it proved of superior 
Validity and worth. The tests are closely timed, and this puts a 
Premium on speed. Yet speed of response is not recognized as one 
Of the primary factors revealed. The profiles are difficult for even 
an expert to interpret, and in effect unintelligible to the layman 
(Stalnaker, 1939, 1940). Thurstone’s position, in the past at any 
rate, has been that the primary abilities are independent of one 
another, yet the following correlations have been obtained: Per- 
ceptual ability with number ability, .50;5 perceptual ability with 
Verbal ability, .61; perceptual ability with spatial visualization, .49 
(Crawford, 1940). They certainly do not suggest independence 
Upon the variables. 

As to effective discriminative and prognostic value, in this it 

As not so far been proved superior to many more familiar well- 


Constructed tests. Goodman (1944), summarizing the HT 
we 


Up to that date, concludes that it predicts college success as 


‘aS most other intelligence tests, though it takes longer than many. 


ne of the more favorable studies is that of Ellison and Edgerton 
(go.), Who report the highest subtest correlations with point-hour 


ratio for college students as .44 for the verbal factor, the next as 
‘31 for the memory factor, and the multiple correlation for all 
t demonstrate 


actors as .64. Or again, the data in Table 22 do no 
Superiority. In addition to this material, Bernreuter and Good- 
man (g.v.) find a multiple correlation of .49 between the battery 
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and semester marks for 170 freshmen engineering students, and 
subtest correlations running from .04 to .38. 


2. SRA Primary Mental Abilities (Primary) * 


This battery, published in 1946, and intended for children of 
ages from 5 to 7 years, contains tests for five primary mental 
abilities, namely, motor ability, perceptual speed, verbal meaning, 
space, and quantitative thinking. This involves six of the eight 


TABLE 22 


CORRELATIONS BETWEEN PRIMARY MENTAL ABILITIES AND EDUCATIONAL 
ACHIEVEMENT 


(Bernreuter and Goodman) 


‘CORRELATION WITH ACHIEVEMENT IN 
VARIOUS FiELDs 


Sem- $ 
Aniiry ester | Chem- | Draw- English Mathe- 
ডী , Compo- ন 
Aver- istry ing sd matics 
sttion 
age 
P. Perceptual ability .. 04 .07 .00 .05 04 
N. Number ability .... -32 27 —.oI 26 নর 
V. Verbal ability ..... 33 32 oI 44 16 
5S. Spatial ability ..... 23 Ig II ed a 
Ma MeEMORE a oe at Io .04 LE 233 —,05 
Le JNANCHON aaoiaioceaia -34 23 IS 2 *29 
BY, PDEOUSHON acme s ss 38 4I IS 21 44 
N V SID (together)... BL 
N V I D (together).... 49 
N V M ID (together)... .49 
N 5S I D (together).... 


49 
NEE A 
primary abilities so far clearly defined in Thurstone’s work, since 
numerical ability and reasoning combine to manifest themselves 
in quantitative ability in children of this age. The test items are 
Presented in graphic form. Thus verbal ability is measured by 
tests in which the child indicates his understanding of word mean- 
* References: Thurstone and Thurstone, 1048: T. G. Thurstone. 
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ings, sentence meanings, and paragraph meanings by indicating 
appropriate pictures. Spatial ability is measured by items requiring 
the child to select from a series of diagrams the one which will 
complete an incomplete square, and by items requiring him to copy 
in the element or elements omitted from an incomplete design from 
the complete design which is presented. Motor ability, which is of 
much importance at this age, is measured by drawing lines con- 
necting dots in parallel rows. The tests are designed for group 
administration, the elapsed time required being about one hour. 

he primary battery is one of a series of similar batteries now in 
Preparation. The battery yields a total score, which is said to give 
a measure of the child’s general learning ability, and also scores 
On each of the five abilities involved. These scores can be con- 
verted into mental ages and into quotients, and the abilities scores 
yield a profile. It is believed that the profile has much more 
Significance both for diagnosis and guidance than the total score. 

hus the verbal-meaning and perception scores are regarded as 
closely related to reading readiness, and the manual presents a 
Seneral discussion showing the advantages of profile ratings over 
8lobal scores. 


3. California Test of Mental Maturity * 


This important battery is yet another instance of a test built 
about Sharply defined and delimited concepts. It has gone through 
Several revisions since its first publication in 1937. It is set up for 
a at five levels, namely, preprimary, primary, elementary, inter- 
Mediate, and advanced. An important feature of the battery is 
he it provides pretests for visual acuity, auditory acuity, and 
t Otor coordination. A synoptic outline of the mental maturity 
ESts themselves, omitting the three pretests, 1s presented in 

gure 23. d 
a The factors about which the battery is Duilt in its present from 

te Iinmediate and delayed memory, spatial relationships, logical 
reasoning, numerical reasoning, and vocabulary. The authors have 
Broceeded on the multi-factor assumption, i.e., that the significant 
“Onstituents of mentality consist of more or less separate primary 
&bilities, rather than of a general factor together with group and 
Special factors. Also they believe that global measures conceal 

Much about » subject's mentality that is of great importance. 
he battery yields profiles based on the indicated factors. Also 


সু 
References: Maxfield, 1937; Tiegs; Traxler, 19375 Traxler, 1939- 
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it yields three kinds of M.A.’s and three kinds of I.Q.’s, namely, a 
language M.A. and I.Q., a nonlanguage M.A. and I.Q., and a 
total M.A. and I.Q. (Traxler, 1937, 1939; Tiegs). 


Test 4. MEMORY (immediate recall) 
A series of word pairs given vocally, with a set of 3 pictured objects 
corresponding to each word pair. Task is to identify object cor- 
responding to the first word in the pair. 


Test 5. MEMORY (delayed recall) 
A story or expository passage read aloud to subjects. This reading 
comes at a late point in the battery, and after it other tests are run. 
After the interval involved, multiple choice items on the material 
are given. 

Tests 6, 7, 8. SPATIAL RELATIONSHIPS 
‘Tasks involving discrimination between right and left, the manipu- 
lation and transposition of geometrical forms, maze problems, etc. 


Tests 9, 10, 11, 15. LOGICAL REASONING 
Tests 9, 10, and 11 are nonlanguage reasoning tests, presenting 


tasks in graphic form. Test 15 is a verbal test of logical reasoning, 
requiring the drawing of formal inferences, etc, 


Tests 12, 13, 14. NUMERICAL REASONING 
A variety of reasoning problems involving the use of numbers. 


Test 16. VOCABULARY 
Fifty 4-choice items calling for the interpretation of words. 


Fic. 23. CALIFORNIA TEST OF MENTAL MATURITY. 
SYNOPTIC OUTLINE 


Reported reliabilities f 
eighties. Educational, bu 
are indicated in the mal 
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as great. Kuhlmann (1939-40) has expressed doubts as to the value 
of labeling tests by the functions they measure or are supposed to 
measure, because no one can tell what those functions are by in- 
spection. Moreover, the practice can easily lead to absurdities that 
have their risky side; as, for instance, if one concludes that a 
child is a “good reasoner” with a “poor memory.” 


APPRAISAL OF INTELLIGENCE TEsTs 


Having concluded our survey of intelligence tests, it is appro- 
Priate to consider the question of appraisal and of standards and 
techniques of evaluation. 


1. Significant Trends in Intelligence Testing 


What are the chief lines of development that have manifested 
themselves in intelligence testing, since the inception of individual 
testing in the work of Binet, and the formative work done in 
Sroup testing in World War I? 

The advance in efficiency is undeniable. Greater ease in adminis- 
tration has been achieved, notably by the adoption of the spiral 
omnibus type of organization in the place of separate subtests. 
This makes it possible to mass instructions at the beginning of 
the test, together with some necessary practice, and does away 
With the practically difficult problem of fine timing, which also 
raises theoretical issues. The spiral omnibus organization has been 
both defended and attacked. The argument in favor of it is that 
1t requires the subject to make frequent and rapid adjustments, 
and that this is considered one of the important aspects of intelli- 
fence. But this is probably farfetched. The argument against it 
is that it makes for boredom and reduces motivation, which is 
Stepped up by a sequence of brief subtests effectively introduced. 

ere is no significant evidence either way, and the truth prob- 
hy is that the specific psychological effect of the device 1s 

egligible. 
. As another factor of efficiency, too, scoring methods have been 
improved. Stencil scoring and machine scoring have become widely 
used. But it is possible that there is a genuine objection here. A 
test setup for stencil scoring or machine scoring must greatly 
restrict the responses of the subject. He must make a mark, or 
underline a word, or write in a word or other symbol in the right 

OX, Or perhaps punch a hole with a stylus. Items which can be 
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thrown into shape for such responses are bound to be restricted, 
and the general effect is to make tests increasingly rigid and cut 
to pattern for convenience. 

Yet another factor of increased efficiency is the development of 
short tests, primarily for screening purposes. Of this the Wonder- 
lic Personnel Test and many of the new tests developed in the 
armed services are excellent instances, as are also the abbrevia- 
tions of the Wechsler-Bellevue scale by Gurvitz and Rabin. This 
trend is significant and widespread (v. Conrad), and may be said 
to reach a high point in the single-item tests described by H. M. 
Hildreth (9.v.). These single-item tests, which of course are still 
more brief and compact than the more usual adaptations of exist- 
ing instruments, were used at a naval training station for rapid 
mental screening, the purpose being not comprehensive measure- 
ment, but simply an assurance that subjects did not fall below a 
certain minimum level of ability. In one such test the problem was 
to give the products of 7 times 7, 8 times 8, 9 times 9, 10 times 10, 
11 times 17, and 12 times 12, and a passing rating was made on 
not more than two errors. Another consisted of the following two 
questions: “Why does the Moon look bigger than the stars? What 
time of day is your shadow shortest?” Full credit was given if 
both were correct. 

There has also been an advance in general statistical adequacy. 
As to genuine psychological or even statistical improvements, 
the showing is not nearly so favorable. With the best tests, such 
as those discussed above, standardization is careful. As to its 
adequacy, it is usually based on quite large groups, though the 
selection and composition of such groups that have the vital func- 
tion of acting as true samples of mental ability may raise some 
questions. Perhaps purchasers and users of tests have become suf- 
ficiently sophisticated to be impressed by standardization groups 
running into the thousands. Highly dubious interpretations, such 
as the derivation of necessarily shaky grade norms are on the 
whole avoided, though some peculiarities can be found, such as 
the derivation of the Otis intelligence quotient. However, the 
standardization pattern established more than twenty years ago 
is still being followed substantially, whatever may be its psycho- 
logical worth. 

A problem connected with standardization that has not been 
solved, and indeed hardly attacked in most test construction, turns 
on the relative difficulty of the items. If many of the items in 
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subtests, or in a whole test organized on the spiral omnibus plan, 
are of about the same level of difficulty, the score means one thing. 
If they are in ascending order of difficulty, the score means some- 
thing very different. Quite insufficient attention has been paid to 
this point, even in the best commercial tests, in spite of the stand- 
Ing monition contained in the work of Thorndike on the L.E.R. 
Intelligence Scale CAVD. 

Item selection throughout the period has on the whole been 
governed by essentially the same considerations—correlation with 
the test as a whole, power to discriminate between subjects 
“known” to be bright or dull on some external criterion which is 
often rather vague, and “opinions of experts.” The idea of select- 
ing items for the building of subtests which will correlate high 
With total scores and low with one another, which was at least 
recognized as an ideal in the Army work, has on the whole been 
disregarded except in principle; and, of course, the adoption of 
the spiral omnibus setup undercuts the whole issue. Item content 
impresses one by its extreme uniformity. The same items appear 
again and again with minor variations, relieved here and there by 
a few ingenious novelties of which the psychological significance, 
if any, is unknown. ' 

So far as this last statement is concerned, the one great qualifi- 
Cation is the increasing application of the techniques of factor 
analysis. When it is proposed to construct a battery of tests whose 
component subtests are factorially pure, i.e., measure one and 
only one definable mental factor, a new and distinctive method 
of item selection at once appears. The application of factor theory 
1S, indeed, the outstanding psychological development in this 
Whole field, as contrasted with increases in efficiency and im 
Provements of already familiar techniques. Kornhauser (1945 al 
reports a very decided trend of opinion among psychologists in 
Avor of profile scores as contrasted with global scores such as 
Percentiles, M.A.’s and I.Q.’s. It is, however, too early as yet to 
Say with confidence how much this means in the way of a tangible 
Improvement of tests. 

However, there cannot be the least doubt that the best tests, 
Properly used and conservatively interpreted, are extremely serv- 
iceable instruments. On the basis of experience as well as of formal 
‘vestigation, they have proved their value as practical tools of 
Buidance. This, as Wechsler and others have pointed out, is the 
Ultimate validation. The psychologists whose views were polled 
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by Kornhauser (1945 a) were asked to rate intelligence tests as 
meeting practical needs (a) in the Army, (b) in schools, and (c) in 
business. Consolidating all responses which gave ratings of ex- 
tremely well, rather well, or better than without tests, the per- 
centages of votes were (a) 88%, (b) 67%, and (c) 67%. Durflinger 
(g..), in a limited but rather interesting study, finds that median 
correlations between intelligence scores and college marks have 
risen from .45 in 1934 to .52 in 1943, and suggests that this may 
indicate an improvement in testing, among other possibilities. Of 
course such figures are very far indeed from being decisive, and 
if there has been a general over-all improvement, it is probably 
quite slight. Indeed, it is possible to ask whether our present-day 
tests, in spite of their greater convenience and efficiency, actually 
predict more surely and significantly than those of twenty years 
ago, which is certainly a limiting factor on their practical valida- 
tion. 


2. Agreement and disagreement among tests 


A. A very considerable but by no means perfect agreement 
among the results of different verbal tests when applied to the 
same subjects has been reported in numerous studies. Thus Guiler 
(1921-22) reports correlations of .85 between the Stanford-Binet 
and the Illinois Intelligence Examination, of .75 between the 
Stanford-Binet and the National Intelligence Tests, and of .8r 
between the National Intelligence Tests and the Ilinois Examina- 
tion for a group of subjects in grades 6, 7, and 8. Then, some 
twenty-five years later we have such reports as that of Sartain 
(g.v.), dealing with the Revised Stanford-Binet, the Wechsler- 
Bellevue, Sroup intelligence tests, and of Traxler (1945), deal- 
ing with the Otis Self-Administering Test and the American 
Council on Education Psychological Examination, and giving 
closely comparable figures. These are representative results, and 
although a very great many more of about the same order, with 
only minor variations, have appeared, it hardly seems worth while 
to labor the point by reproducing them. What such results seem to 
indicate is that verbal intelligence tests have been rather con- 
Fteatly built about a similar concept translated into similar 

ems. 

B. However, it must be remembered that such correlations rep- 
resent central tendencies or mean trends. Such mean relationships 
may be quite high, and still there may be much variability. This, 
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in fact, is the case. Thus P. Cattell (1930) finds variations in 
individual performance as between different tests specially marked 
in upper extreme cases. As between the Otis Self-Administering 
Test and the Stanford-Binet, the mean difference at I1.Q. 7o was 
only .4, but at I.Q. 130 it was 15.8. High correlations, it must be 
recalled again, mean stable relative standing, not equal scores. 
Gates (1923), again, found that I.Q.’s for the same pupil on six 
standard intelligence tests may range from 104 to 144, While mean 


‘class I.Q.’s may range from 109 to 129. Similarly, Miller (q.0.) 


gave ten intelligence tests to a group of 57 university freshmen, and 
obtained I.Q.’s ranging from 117.5 on the Stanford-Binet to 138.5 
on the Miller Mental Ability Test for the same individual. ‘The 
clear conclusion is to beware of making absolute ratings on any 
single test, although the relative standings it reveals may be 
Significant enough. 

C. Such differences in the rating of a given individual or group 
On different tests are accentuated and become a matter of constant 
expectation when comparisons are made between standings on 
verbal and performance tests. Thus Seagoe (g.v.) found that a 
test like the Pintner-Cunningham Primary is likely to yield L.Q.’s 
from 5 to 11 points higher consistently than such tests as the 
Terman Group Test of Mental Ability or the National Intelligence 
Tests. So, too, the Pintner-Paterson Scale of Performance Tests 
and the Arthur Scale of Performance Tests yield ratings consist- 
ently higher than those dependent on standard verbal tests. Corre- 
lations between performance type and verbal type tests are signifi- 
cantly lower than those between various good verbal intelligence 
tests. Thus it seems clear that a different conception of intelligence 
1s Involved— different perhaps not so much in the words in which 
it might be framed as in the items into which it is translated. 

D. There seems little doubt that the main reason for variations 
among tests is that their norms are based on different standardiza- 
tion groups. Various attempts have been made to overcome this. 

hus Kefauver (g.v.) gave several tests to the same group and 
Worked out standard scores based on its performance in them. 
Steckel (q.v.) restandardized the Kuhlmann-Anderson test, the 

tis Group Test of Intelligence, Intermediate Examination, and 
the Otis Group Test of Intelligence, Advanced Examination, on 
10,799 children in the schools of Sioux City, Iowa, who constitute a 
Teasonably homogeneous population, and derived percentile rank- 
Ings of I.Q.’s which make inter-test comparisons possible. Cole 
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(g.v.), again, has worked out and reported elaborate conversion 
tables which equate the scores made on the Terman Group Test 
of Intelligence, the Otis Group Test of Intelligence, Advanced 
Examination, and the Otis Self-Administering Test of Mental 
Ability. Obviously, one of the most crucial of all points of test 
construction is involved; namely, the use of a standardization 
group as a true and sufficient sample from which general conclu- 
sions can be drawn. This cannot help but make difficulties, and 
there is no way of avoiding them, for even the three comparable 
restandardizations just described are themselves based on some’ 
specific standardization group (see also Runnels). 


3. Comparative evaluations of tests 


Some attempts have been made at relative appraisals of various 
tests—to decide which are better and which are worse—but with- 
out much broad success. 

A. There have been comparisons between various group tests 
and the Stanford Revision of the Binet scale, with varying and 
ambiguous outcomes so far as the relative merit of the tests com- 
pared is concerned. Thus Turney and Fee (g.v.) rank the Otis 
Self-Administering Test (Intermediate), the Terman Group Test, 
the National Intelligence Tests, and Haggerty Delta 2 in the 
stated descending order in terms of their agreement with Stan- 
ford-Binet I.Q.’s for the same group of subjects. The differences, 
however, are not great. Nor is it clear why the Stanford-Binet scale 
should be accepted as a norm of excellence. 

B. Another attempted criterion has been the intercorrelation 
of a test with a battery of other similar tests. Again, Turney and 
Fee report the Otis Self-Administering Test (Intermediate), Hag- 
gerty Delta 2, National Intelligence Tests, Terman Group Test, 
McCall Multi-Mental Scale in the stated descending order in 
terms of mean intercorrelations with the whole battery. But the 
coefficients are .761, .756, .755, .753, and -695, So that the true 
differences in relationship are trivial. Also, the point arises that 
if we had a genuinely and markedly superior test it would pre- 
sumably not correlate well with other and inferior instruments. 

C. Yet another criterion occasionally used is the distribution 
of scores yielded by a test. It is commonly held that a small or 
relatively limited spread of scores is preferable to a large one. 
As a universal proposition, however, this is open to considerable 
question. Kuhlmann (1930); as we have seen, prefers a wide dis- 


TESTS OF INTELLIGENCE 22I 


tribution of scores, and very justly remarks that there is no reason 
suppose that a test which yields it is not revealing the true 
acts. 

D. Another criterion has been the relationship of the scores on 
a given test to some external standard, nearly always school or 
college achievement. Thus Jordan (g.v.) reports the correlations 
between certain test scores and high school marks as follows: 
Army Alpha, .38 to .41; Otis Group Test, average .66 with a range 
of .33 to .91; Terman Group Test, average .47 with a range of .30 
to .67. Guiler (1927) reports the Terman Group Test, the Otis 
Self-Administering Test, and the Ohio State Examination as cor- 
relating with college grades respectively .52, .49, and .47. It is 
noteworthy that correlations between intelligence scores and scores 
On such a broad instrument as the Stanford Achievement Test 
run higher than those between intelligence scores and marks, the 
Obvious reason being that marks are rather low in reliability. In 
general, the wide spread of obtained correlations and the con- 
flicting mean correlations reported by various studies make test 
evaluation on this basis impossible. The nearest approach to some- 
thing significant here is the finding reported by Gates and La Salle 
(g.0.) that Stanford-Binet ratings maintain about the same rela- 
tionship to educational achievement over increasing intervals of 
time, whereas the relationship of ratings on all the other tests they 
Studied drops with the passage of time. 

E. Kuhlmann (1928) has approached the question in a some- 
What different way. He compared seven standard intelligence tests 
with the Kuhlmann-Anderson Intelligence Tests in terms of their 
discriminative capacity. The assumption was that a good test 
Should show marked advances with age and grade, and the less 
Overlap the better. He was able to show that the Kuhlmann- 
Anderson test was markedly superior in these respects. This is 
Perhaps the most impressive comparative study published. 

So one cannot make confident general statements about the 
Superiority or inferiority of the group tests we have been con- 
sidering. Of course, it is possible to criticize methods of con- 
Struction, standardization, etc., but this is a different matter from 
direct comparative ratings. Also, the story is different if we have 
Special purposes in mind. Army Alpha and the Otis Quick-Scoring 

ests are too easy for college populations. The Kuhlmann Tests 
of Mental Development and the Kuhlmann-Binet scale are better 
as diagnostic instruments than the Stanford-Binet scale, and so 
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is the Wechsler-Bellevue scale. The latter has an outstanding 
superiority for use with adults (v. Rapaport and Others). A test, 
in short, is a practical instrument, to be used with specific groups 
with specific purposes in mind, not an instrument capable of 
absolute measurement. And in general it would seem that all care- 
fully and competently constructed group tests are on about the 
same level of excellence. 


4. What group intelligence tests measure 


That group intelligence tests, and more particularly those of 
verbal type, are committed to a translation of general intelligence 
substantially into terms of academic ability, has long been fairly 
clear. The material they use is drawn largely from curricular 
sources, and performance is undoubtedly influenced by learning 
similar tasks in school. Thus Bishop (g.v.) set up 4 groups, 2 of 
them from grades 7 and 8 and designated as A and B, 2 of them 
from grades 9 and 10 and designated C and D. These groups were 
equated in intelligence on the Otis Group Test. Lessons were 
devised to parallel the ten pages of the test, not containing the 
same material but similar in principle. Group A was taught the 
first 5 lessons, group B the second 5, group C all 10, and group D 
none. When the study period was over, they were retested. Group 
A gained 40% on the first half of the test and 6% on the second 
half. Group B gained 31% on the second half and 15% on the 
first. Group C made an over-all gain of 30%. Group D gained in 
all 11%. In other Words, parallel though not identical teaching 
greatly affected test performance. And there is no question but 
that the curriculum contains much material which at least in a 
general way parallels the content of group mental tests. More- 
Over, one should notice the very strong tendency to use school 
groups both for standardization and for validation. So the tests 
are certainly closely related to school work and achievement. 
However, as Previously argued, this does not mean that they do 
not reveal mentality but Only special aptitude, unless one is pre- 
pared to say that Succeeding in school calls for a special and lim- 


ited talent. 
SUGGESTED ADDITIONAL READINGS 


For additional reading and il i 
i h more intensive study of the material in 
this chapter the most ‘Mportant sources are the tests and more par- 
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ticularly the manuals of the tests discussed. Publishers will be found 
listed in the bibliography of tests at the end of the book. Also the 
references mentioned in the text in connection with the various tests 
may be consulted. Further readings and suggestions are as follows: 

J. E. Anderson, “The limitations of infant and preschool tests in 
the measurement of intelligence,” Journal of psychology, 8 (1939), 
351-79. A critical review of such tests and a broad discussion of their 
limitations. 

Beth L. Wellman, The intelligence of preschool children as meas- 
ured by the Merrill-Palmer Scale of Performance Tests (University 
of Iowa Studies in Child Welfare, vol. 15, no. 3, 1938). To a con- 
siderable extent a “test of the test.” See pp. 144-48. 

T. W. Richards, “The relationship of psychological tests in the 
first grade to school progress; a follow-up study,” Psychological 
clinic, 21 (1932), 137-77; also “Psychological tests in the first 
grade,” Psychological clinic, 21 (1932), 235-42. The second article 
is a continuation of the first. A general survey of a number of first 
grade tests. 

David Wechsler, The measurement of intelligence (Baltimore: The 
Williams and Wilkins Company, 1944), Chapter 2, “The need for an 
adult intelligence test.” 

Oscar Krisen Buros (editor), The 1938 mental measurements year- 
book (New Brunswick, N. J.: Rutgers University Press, 1938); also 

he 7940 mental measurements yearbook (Highland Park, N. J.: 

. The Mental Measurements Yearbook, 1041). These valuable refer- 
ence works should be consulted for test reviews and bibliographies. 


QUESTIONS FOR DISCUSSION 


1. Examine the statistical data here presented on the values of 
tests for early childhood, supplementing it if possible with data from 

€ references, and formulate the general conclusions that seem indi- 
cated, citing specific material. 

2. Compare the status and stability of Merrill-Palmer I.Q.’s with 
that of Stanford-Binet I.Q.’s and Wechsler I.Q.’s. 
3. Discuss the relative importance of the factors tending to lower 
the stability and prognostic value of tests for early childhood men- 


tality, 

4. Examine the item content of the tests here presented in synoptic 
Outline. To what extent do the items seem indicative of intelligence? 
What other factors might they indicate? 

5. Would a child’s reactions to a test situation, e.g., negativism, 
be themselves significant? If so, of what? 

6. Does it appear to you significant that as soon as tests go either 
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above or below the school ages their efficiency seems to be lower? 
Consider some reasons for such a phenomenon. 

7. Might one properly argue in the light of the data and discus- 
sions here that intelligence tests, even at the best, do not measure 
intelligence at all? Tf so, what might they really measure? 

8. To what extent do Stoddard’s recommendations for an adult 
testing program seem to you adequate? Would you supplement them 
in any way? 

9. Might some such program be used for personnel work? For other 
purposes? Specify and discuss. 

10. What peculiar dangers might result from a wholesale adoption 
of profile ratings? Might this lead the whole testing movement into 
disaster? 


সি 


CHAPTER VII 


APTITUDE TESTING 
THe Basic CoNcEPT 


An immense amount of work has been done in the huge and 
loosely bounded area of aptitude testing. This work has gained 
in general significance and to some extent in actual scope due to 
the increasing dissatisfaction with the global ratings obtained by 
intelligence tests. Here, as always, the basic problem is the search 
for and the practical definition of the conception of the aptitude 
to be measured. 

The term aptitude has been defined many times. Two such 
definitions are here cited. According to Bingham (g.v., P. 18); 
“aptitude, then, is a condition symptomatic of a person's general 
fitness, of which one aspect is his readiness to. acquire proficiency 
—his general ability—and another is his readiness to develop an 
interest in exercising that ability.” According to Freeman (1939; 
P. 182), “an aptitude is the ability or collection of abilities re- 
quired to perform a specified practical activity.” Freeman points 
out that an aptitude is not to be thought of as necessarily innate, 
Which is an important systematic caution. But on the other hand, 
it is not a direct product of special training. Thus an aptitude for 
machine design is not the product of training in machine de- 
sign, but the ability, among other things, to profit by such 
training. 

Certainly we must avoid thinking of aptitudes as faculties, or 
unitary mental entities. Rather, they must be considered as dy- 
namic trends of the whole personality. 

All this is certainly vague enough, and the various characteriza- 
tions take in a great deal of territory. Indeed, the boundary lines 
are anything but clear. Words such as aptitude, talent, special 
ability, trait, and so forth, are constantly used in overlapping 
Senses, and the differences between them are hazy. It is the present 
Writer’s decided opinion that attempts at pedantic clarity of defi- 
nition here do very little good, and may easily do harm. Neat 
classifications of existing instruments ef measurement into apti- 
tude tests, talent tests, tests of special ability, and s9 forth, really 
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tell very little and only exercise the mind to small purpose. Con- 
fusion, in all conscience, is quite great enough without adding to 
it by a scholastic concern for the precise meaning of words. This 
is why, in the treatment here presented, a great variety of tests 
are lumped together under the single broad classification of apti- 
tude. The general idea is that there is something in the mental 
Organization that makes one good at clerical Work, or mechanical 
Work, or science, or mathematics, or administrative or military 
pursuits, or art, or music. Whether one wants to call this some- 
thing an aptitude, or a talent, or a trait, or a special ability, or 
something else, is not of much moment. What is actually before 
US is a wide diversified endeavor to reduce these “somethings” to 
terms of practicable measurement and prognostication. 

In general, there are two methods of doing this, though most 
Workers with a job of test construction on their hands may not 
confine themselves to one and exclude the other. (a) The job or 
practical function in which the aptitude expresses itself may be 
analyzed to single out its psychological components, perhaps by 
factor analysis, perhaps by simpler means, and then test items 
are organized to reveal these components. The extreme application 
of this procedure is found in the so-called work sample test, made 
up of actual samples of the activities involved in the job, or 
activities very closely analogous to them. (b) A much more general 
Psychological analysis of the ability or aptitude in question may 
be undertaken, and this again translated into test items. 

Good examples of both procedures will be found in this chapter. 
Instances of the former are the Minnesota Test for Clerical Work- 
ers and the Orleans-Solomon Latin Prognosis Test, which are 
made up of tasks closely similar to those in clerical occupations 
and the learning of Latin. An outstanding instance of the latter 
is the Seashore Measures of Musical Talent, which turn entirely 


ona Psychological Conception of musical ability, and contain no 
items from musical activity itself. i 


MiasuREs or Moor ABsrriry 


Numerous attempts have bee i ili 
১) n made to single out motor 
Which, be it noted, is a mu H ELA 


! ke ch more restricted function than 
mechanical aptitude, as a definable and measurable aptitude. 


Typi i 
LU instances of the tests that have resulted are as 
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1. Finger Dexterity Test * 


The test is intended to measure moderately fine digital adjust- 
ment and control. It consists of a modified peg board. The material 
is a large metal plate with 100 holes, and also goo metal pegs one 
inch long, which fit three each into the holes. They are to be 
picked up three at a time and placed in the holes till all are filled. 
‘The score is the time taken. This test has had a fairly wide use. 
Its reliability, validity, and general significance are none too clear. 
The norms worked out and reported by O’Connor are vague. Still, 
it has been found serviceable in testing small motor reactions. 


2. Tweezer Dexterity Test T 


The task is to use a pair of tweezers to place small metal pins, 
one by one, in 100 holes in a metal plate. It is scored on time in 
seconds, i.e., the number of seconds between placing the first and 
last pins. It has a satisfactory reliability. Norms are worked out 
and reported. Thus on a standardization group of men, the score 
of 255 produces a standard score of 17-5, a percéntile score of 
99.4, and a letter grade of A. A score of 635, indicating much 
Slower performance, corresponds to a standard score of 2.5, a 
Percentile score of 0.6, and a letter grade of E—. 


3. Minnesota Rate of Manipulation Test + 


The material consists of a long board with 60 round holes in 
4 rows, 15 holes to a row, and the same number of cylindrical 
blocks of diameter one-sixteenth inch less than that of the holes. 
The placing test consists of putting the blocks into the holes with 
one hand. The turning test consists of taking them out with one 
hand, turning them over, and replacing them with the other hand. 
The score is on the time required. 

The test is useful for occupations and activities that require 
Speed of gross movement, which it measures—package wrappers, 
Packet stuffers, assembly-line workers, possibly typists. 


4. Stanford Motor Skills Test § 


This battery consists of 6 serial dexterity tests, selected from 
among 20 tentatively considered, for the following reasons. (a) 

* References: O’Connor, 1928, 1938; Bingham; Green and Berman. 

T References: O’Connor, 1928, 1938; Bingham; Green and Berman. * 

Ed References: Bingham ; Green and Berman. 

§ Reference: Robert Seashore, 1926, 1928. 
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Adaptation to school and factory use. (b) Economy of tinie, the 
total time required being less than 2 hours. (c) Compactness and 
practicability of the needed material. (d) Automatic scorability. 
(e) High retest correlations. The retest coefficients reported run 
from .75 to .86. (f) Low correlations with the Thorndike Intelli- 
gence Examination for High School Graduates. (g) Low correla- 
tions with special training in motor skills, e.g., in typewriting, 
playing musical instruments, and athletics. (h) Low intercorrela- 
tions among subtests, the mean being .25. 

The subtests of the battery are as follows. (1) Koerth Pursuit 
Test. The subject holds a metal stylus on a metal disk one-half 
inch in diameter mounted on a Phonograph turntable set for one 
revolution per second. Thus the disk follows a circular path. The 
score is the distance through which contact is maintained for 20 
consecutive seconds. There are 10 trials allowed. (2) Seashore 
Motor Rhythm Test (v. R. Seashore, 1926). This test requires 
the subject to tap out various rhythmic patterns dictated in taps, 
using a stylus with electric contact which records the result. It is 
scored on number of successful reproductions. (3) Tapping Speed. 
The subject presses and releases a telegraph key as fast as he 
can for a period of 5 seconds, a record being made of the result. 
Two trials are given. (4) Serial Discrimination. Four numbers are 
exposed (1, 2, 3, 4), and the Subject responds by pressing the 
appropriate key out of 4 before him. The numbers are exposed 
Visually in random order. The Score is the number of correct 
responses in 2 minutes. (5) Brown Spool-Packer Test. This con- 
sists of packing spools in a small box, using both hands. The score 
is the number packed in 3 minutes. (6) Miles Drill Test. Consists 
of rotating the handle of a small drill as fast as possible for 10 
seconds. Score is number of rotations. There are 3 trials. 

With all these and similar tests the great difficulty is the narrow 
range of their validity. They do not reveal or measure 
aptitude or factor of general motor efficiency. In all probability 


nN and again on a 
Or piles according 
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taneously. He found virtually no intercorrelations and concluded 
that there is no such thing as general motor ability. The attempt 
at rebuttal and defense made by Garfiel (g.v.) is not convincing, 
for what she showed was that ratings by judges on the general 
motor ability of subjects show high correlations, which seems to 
demonstrate something about the rating rather than the ability 
itself. Buxton (q.v.), again, ran 9 motor tests with 76 boys, the 
tests including 2 for motor steadiness, 3 of tapping speed, packing 
spools, packing cubes, rotor mobility, and rotor pursuit. A 
factorial analysis revealed no general component of motor 
ability. 

Of course, it seems very reasonable, and also very inviting, to 
believe that human beings can be classified in terms of a general 
motor efficiency manifesting itself in all their doings. Tf this were 
possible, it would be very useful. However, our existing psycho- 
metric instruments do not make it possible, and there is reason to 
doubt even its theoretical possibility. The tests we have seem to 
Possess little significance beyond themselves and their immediate 
and obvious applications. This is even true, in all probability, 
of the motor rhythm test, although “the importance of being 
rhythmic” is to many persons a golden thought. Jersild and 
Bienstock (9.%.), using a motion-picture technique, made very 
accurate measurements of children’s clapping and stepping to 
music. Quite possibly this might be more revealing than the 
Seashore Motor Rhythm Test, which calls for the reproduction of 
rhythms in clicks, since the function seems more meaningful. But 

€ correlation between rhythmic efficiency and singing was only 
‘30. Motor tests, then, are serviceable for judging promise in con- 
nection with activities, jobs, and functions with which they are 
directly related. But they do not seem to reveal any such factor 
8S general motor efficiency. This may be because of the limitations 
Of the tests, or because no such factor exists. 


TEsTs OF MECHANICAL APTITUDE 


a higher level of organization 
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Which call for the actual manipulation and assembling of mechan- 
ical objects. (b) Paper-and-pencil tests, calling for a knowledge 
of parts of machines, an understanding of mechanical relation- 
Ship, and so forth. A good many of the items used in all such tests 
have some resemblance to those appearing in performance tests. 
Typical examples of both kinds are given below. 


1. Stenquist Assembly Test of General Mechanical Ability * 


The test consists of assembling two series each of 10 small 
objects such as may be purchased in the local stores. The objects 
are presented disassembled. The score is the number of articles 
correctly assembled in go minutes. These two series, which are 
Shown in Figure 24, are intended for subjects from the sth grade 
level to adult. A third series has been added intended for subjects 
from the 3rd to the sth grade. 


SERIES I SERIES II 
1. Cupboard catch I. Sash fastener 
2. Chain 2. Rope coupling 
3. Mousetrap 3. Defiance paper clip 
4. Hunt paper clip 4. Expansion nut 
5. Bicycle bell 5. Double-action hinge 
6. Shutoff 6. Calipers 
7. Lock no. 1 7. Elbow catch 
8. Push button 8. Lock no. 2 
9. Clothespin 9. Expansion rubber stopper 
10. Wire stopper 10. Pistol ' 


Fic. 24. ARmicLEs To BE ASSEMBLED: STENQUIST ASSEMBLY TEST 
OF GENERAL MECHANICAL ABILITY 

The test 

mechanical aptitude by teachers of sho 


from Table 23. 


e tes a ly Unrelated to ener telli ence, as 
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TABLE 23 


CORRELATIONS BETWEEN SCORES ON STENQUIST ASSEMBLY TEST OF 
GENERAL MECHANICAL ABILITY, SERIES I, AND 
RATINGS BY TEACHERS 


(Stenquist) 
7th and Sth grade boys, Lincoln School... .. eee 83 
Sth grade boys, New York public schools... .. eee .80 
Sth grade boys, New York public schools. .... 42 
6th and jth grade boys, Horace Mann School.. .8I 
6th grade boys, Horace Mann School. ........ aiitanss Ts FROGS 
6th grade boys, Horace Mann School..... as 5 মৰ শক 88 


“ quist test and the intelligence score on Army Alpha of a group of 


909 members of the Army Engineer Corps. i 

Reliabilities as reported by Stenquist (odd-even) range from 
‘80 to .06. Clearly there is considerable likelihood of the incidence 
of variable error. Paterson and his co-workers extended the test 
by adding six more items, and obtained an odd-even reliability 
Of .94 (Paterson and Others, 1930). 

Since the test shows low correlations with verbal intelligence, 
and high correlation with such criteria as teacher ratings, the 
Conclusion is that it is a true aptitude test, and not a performance 
test of intelligence. 


TABLE 24 


CORRELATIONS OF SCORES ON STENQUIST TEST OF GENERAL MECHANICAL 
ABILITY WITH INTELLIGENCE AS REVEALED BY SCORES ON ARMY ALPHA 


LOO MINSEIECHEd Ensen Bea oan Faie pain ET ia VA He -323 
107 American born, mostly inferior intelligence. ........ 350 
TOU Sclec fed! AdUES eb Be So eres St. HEME OO 
30 adults below score 50 on Alpha.........nseeccceee .0 
216 adults, low-grade intelligence .....ucccceeseeeeeee .0 
909 adults, Engineer Corps DIO 
30 feeble-minded cee * 32 


1007 children, 7th and 8th rl Sc SPER 397 
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2. Minnesota Mechanical Assembly Test * 


This is one of a battery of four tests developed under the 
direction of Paterson. The others are the Minnesota Spatial Rela- 
tions Test, the Minnesota Paper Form Board, and the Minnesota 
Interest Analysis Test. Of these three, the first and second are 
discussed below. 

The Mechanical Assembly Test here under consideration is an 
extension and elaboration of the Stenquist Assembly Test of 
General Mechanical Ability. It consists of 33 disassembled objects 
similar to those used by Stenquist, which come in 3 boxes, A, B, 
and C, the parts of the objects being in different compartments. 
The complete assembling of each object involves a stated number 
of “connections,” i.e., the bringing together of any two parts of 
the object. A fixed time is allotted to each object, and scoring 
depends on how much is done within the time limit. Tf the object 
is completely assembled, a score of 10 is earned. If it is partially 
assembled, the score is computed on the number of connections 
actually made as related to the total number required for the job. 
Thus the bottle stopper in box A requires 3 connections, and the 
Scores may be 0, 3, 6, 10. The spark Plug in box B requires 5 con- 
nections, and the scores may be 0, 2, 4, 6, 8, Io. 

This test, like the other three in the battery, was developed as 
a result of long and elaborate research, the reports of which con- 
stitute one of the fullest analyses of mechanical aptitude available. 
A large number of tests and test items were tried out, and a three- 
. fold validation criterion was developed. This consists of (a) a 

measure of the quality of mechanical work done, i.e., the quality 
criterion; (b) a measure of the quantity of mechanical work done, 


in relation to quality, i.e., the quantity-quality criterion; (c) a 
measure of information abo 


i.e., the information criterion. The 


Persons in mechanical Occupations tend to show a superiority 


on this test. It has very little relationship to general verbal in- 
telligence or to motor agility. Thus it was found that 70% of 


mechanics did better than the average clerk, but only 11% of the 
mechanics equalled the verbal intelli 


Sence score of the averave 

clerk. i 
ফু : Ei, j 

Bin Paterson, Elliott, Anderson, Toops, and Heidbreder; Hunt; 
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3. Minnesota Spatial Relations Test * 


The test consists of 4 form boards, A, B, C, D, each with 58 
Pieces to be placed. One set of blocks is used for boards A and B, 
and a second set for boards C and D. The scoring is on both time 
and errors, an error meaning an attempt to place a block in a 
Wrong hole. A reliability of .84 and a validity against the three- 
fold criterion mentioned above of .53 is reported. The test has a 
low relationship to general verbal intelligence and to agility. Its 
validity for specific vocational forecasts is unknown. It evidently 
measures a component of mechanical ability, for 102 automobile 
mechanics were found to make a better mean score than 82% of 
an unselected population. 


4. Minnesota Paper Form Board ft 


A sample item from this test appears in Figure 12. The ma- 
terial consists of sets of geometrical figures similar to the set there 
Shown. On the left side of the sheet is a large figure, and on the 
right are smaller figures. The task is to draw lines in the larger 
figure to show how the smaller ones can be fitted into it. There 
are 2 series, A and B. Timing is 15 minutes for each series. The 
Score for each is the number of right solutions. A reliability of 
‘90 and a validity of .52 against the criterion used in all these 
tests and described above has been reported. 


5. IL.E.R. Assembly Test for Girls 


This is an assembly test similar to the Stenquist and the Min- 
nesota, but using material more suitable for girls. There are 11 
Subtests as follows. (1) Stringing beads. (2) Inserting tape. (3) 
Making rosette. (4) Cross-stitching. (5) Assembling key ring. 
(6) Assembling clips. (7) Tape sewing. (8) Attaching trunk tag. 
(9) Wrapping string around cards. (10) Assembling booklet. 
(11) Cutting and trimming paper. In the short form worked out 
by Metcalfe and Burr, subtests 1, 5, 8, and 10 have been elimi- 
nated because of various inconveniences and inadequacies, leaving 
7 Subtests. The time limits are—for the complete form 45 min- 
utes, for the short form 25 minutes. Each form is scored on 
adequacy of response to the subtests. The short form correlates 
with the long form .93 for 1,300 cases. 

* References as above. 

T References as above. 

¥ References: Burr and Metcalfe, 1936, 1937; E. B. Greene; Toops, 1923. 
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The test has proved useful with factory jobs requiring routine 
piecework of the general type included. Some vocational norms 
have been worked out. A score of 50% on subtests 2 and 3 is said 
to indicate fitness for simple assembly jobs; 50% on subtests 6 
and 9 is said to indicate fitness for harder assembly Jobs; 50% 
On subtests 4 and 7 is said to indicate fitness for sewing (Metcalfe 
and Burr). 


6. Hand Tool Dexterity Test (Bennett) 


The test consists of a wooden frame with two uprights, in one 
of which are 12 bolts in 3 rows of 4 each, and in the other 12 
corresponding holes. Three wrenches and a screwdriver are pro- 
vided. The task is to disassemble the bolts from one upright and 
reassemble them in the other. The score is the time required. The 
test measures proficiency with ordinary tools, which the author 
regards as a combination of aptitude and achievement based on 
experience. Brief practice does not seriously affect the scores. A 
retest reliability of .91 is reported. The test has been found to 
correlate .46 with foremen’s ratings. Percentile norms are given 
for factory workers and for adults in a vocational guidance center. 


7. Stenquist Mechanical Aptitude Test * 


Unlike those so far discussed, this is a paper-and-pencil test. 
It is intended for use for grades 9 to 12. It consists of two parts. 
The first requires the matching of mechanical objects, e.g., a 
Wrench to go with a spark plug as the proper tool to use. The 


second calls for knowledge of the parts of machines and of 
mechanical objects. 


This is another test of the same general type, i.e., calling for 
an actual manipulation or 
art I is pictorial. It is made 
£ to indicate which pictured 
) Should be used on the pic- 
so, there are items in which 


* Reference: Stenquist, 
T References: Bingham ; Fryer (1937). 
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form. The basic assumption here is that continuing mechanical 
interest will result in gaining and retaining mechanical infor- 
mation. 

The test was standardized on 9,000 men of ages from 15 to 24. 
The percentile and standard score norms which are shown in 

able 25 were worked out on about 70,000 cases. It is said to 
Correlate from .64 to .84 with ratings by teachers of shopwork, 
but the figures seem extremely high, and the validation process 
1S not well explained. The test was widely used by the Tennessee 
Valley Authority. 

S is Shown in the data here cited, and in the large amount of 
additional material to be found in the reference sources, tests 
Re as these, of which many others exist, can be adequately 

lable. They are unrelated to general intelligence and to motor 
agility, So far, then, mechanical aptitude seems to stand up as a 


TABLE 25 


Raw SCORES ON O'ROURKE MECHANICAL APTITUDE TEsT WITH 
CORRESPONDING STANDARD SCORES AND CENTILE SCORES 


(From Bingham, Table 34, P- 319) 


Raw Scores Standard Scores Centile Scores 
Fre FM — — wom 

317 7.0 97.7 

295 6.5 93.3 

265 6.0 84.1 

' 233 0) 69.1 

k 198 5.0 50.0 

172 4.5 30.9 

IIS 3.5 6.7 
OG 


LS Well delimited concept. But factor analysis reveals that Kl 
cp tainly not a unitary compotent of human mentality I 
Havior (v. Paterson, Elliott, Anderson, Toops, and Heidbre fl 
RELL have, therefore, another practical working concept, 
lars serves fairly well. The tests built around it, and Eon 
and a3 SSembly tests, can serve useful purposes for VE 0 

mene Ucational guidance where immediate choices Se 
S are the issue. For long-time predictions, however, ey are 
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open to the gravest doubt. Thus Thorndike (v. Thorndike, Breg- 
man, Metcalfe) made a follow-up study of 2,225 children who were 
tested in the 8th grade for intelligence, clerical aptitudes, and 
mechanical adroitness on a battery of 14 tests. They were followed 
up for about 8 years, and the vocational success of those leaving 
school was studied. No success at all in predicting success in 
mechanical occupations was reported, although a large range of 
such occupations was included. The value of paper-and-pencil tests 
vf mechanical aptitude is decidedly more doubtful than that of the 
assembly tests, in spite of the impressive validation coefficients 
reported by O’Rourke, which are open to considerable question. 


9. Test of Mechanical Comprehension * 


The test consists of 60 pictures of mechanical situations, in each 
of which a problem exists, e.g., which of 2 pairs of shears would 
cut metal better; which of 2 cords is an ordinary electric light 
cord ; which of 2 rooms would have more echo; in which direction 
is the last of a set of gears turning. There is no time limit, but 
the test usually takes 20 to 25 minutes. There are two forms. Form 
BB is suitable for male candidates for engineering schools, engi- 
neering students, and comparable adult men. Form AA is for males 
in high school or trade school. Form BB is about 12 points more 
difficult. The test is not well suited for women. The mean score 
of women in educationally comparable groups is about 14 points 
lower than that of men. Answer sheets are provided for hand or 
machine scoring. Split-half reliabilities of .80 and ‘84 are given, 
with corresponding standard errors of 4.3 and 4.5. Test-retest 
reliabilities given are ‘90 to .93, with standard error of 3.0. The 


test correlates from .3o to .60 with success in engineering-type 
occupations. Percentile norms are provided. 


TEsTs FoR VOCATIONAL APTITUDES 


Very large numbers of tests of various types exist for the 
measurement of aptitude in numerous Vocations. These tests are 
of varied type. Some envisage classes or groups of occupations, 
such as clerical, mechanical, and the like. Others are intended to 
uncover fitness for some specific vocation or job: Since these are 
of minor psychometric and Psychological interest, none of them 
is treated here. The boundary lines, however, are indistinct, as 

* References: Bennett and Cruickshank, 1942, a and b; Bennett and Gear. 
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they always are in the whole field of testing. The two tests for 
clerical workers here discussed might be considered as belonging 
to either of the two categories above. Both of them, and more 
particularly the Minnesota test, are included to give the reader a 
concrete example of a measuring instrument based primarily upon 
job analysis in. contrast to broader psychological considera- 
tions. 


1. Detroit General Aptitudes Examinations * 


The instrument consists of 16 subtests, each with a timing of 
3to 5 minutes. (1) Rate and quality of handwriting. (2) General 
information. (3) Arithmetic. (4) Motor speed, shown by tracing 
circles. (5) Knowledge of the names of tools shown in pictures. 
(6) Disarranged pictures. (7) Verbal opposites. (8) Spelling 
errors, i.e., misspellings to be indicated. (9) Size discrimination. 
(10) Verbal analogies. (11) Checking test, for speed and accuracy, 
pairs of names and numbers to be indicated same—different. 
(12) Tool information. (13) Classification. (14) Tracing mechan- 
ical relationships shown in belt-and-pulley drawings. (15) Dis- 
arranged sentences. (16) Alphabetization. 

In the scoring, each subtest is given 30 t6 53 points. Scores may 
be on three bases: (a) intelligence, i.e., subtests 2, 3, 4, 6, 7, 8, 10, 
13, 14, 15; (b) clerical aptitude, i.e., subtests 1, 3, 4, 6, 8, IT, I3, 
15, 16; (c) mechanical aptitude, i.e., subtests 1, 3, 4, 6, 9, 12, I3, 
14. Thus there are 5 subtests in common as between the intelli- 
gence and mechanical aptitude scores, 6 in common as between 
the intelligence and clerical aptitude scores, and 5 in common as 
between the clerical aptitude and mechanical aptitude scores. 
Correlations between sets of scores are reported as being .80, .70, 
and .73. Correlations with independent measures of intelligence 
are as follows: With the Detroit Advanced Intelligence Test, .90 
for 188 12th-grade children; with Stanford-Binet (1917) L.Q-s, 
‘652 for 188 12th-grade children; clerical score with Detroit Ad- 
vanced Intelligence Test, .739 for 188 12th-grade children. There 
is no demonstration that this is a true aptitude test, or that it has 
special validity for clerical or mechanical occupations. Its relia- 
bilities for the three scores are .80, .90, .88 (retest). Age norms are 
reported on about 10,000 cases. The general conclusion seems to 
be that this test, in spite of its name, functions as a general intelli- 
gence test. 


* Reference: Lorge, 1047 b. 
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2. Detroit Clerical Aptitudes Examination * 


This test, which comes in a separate form, consists of subtests 
1,3, 4,6, II, I3, and 16 of the above, and in addition a subtest on 
commercial vocabulary. 


3. Detroit Mechanical Aptitudes Examination Tt 


The test consists of subtests 3, 4, 5, 6, 9, 12, 13, 14 from the 
General Aptitudes Examination, and no others. 

As to both the last-mentioned tests, the clerical score or the 
separate clerical test may give some basis of understanding not 
given in the intelligence items, although just what this may be is 
not clear; the mechanical score and separate test are based, how- 
ever, on an entirely undefined concept and there is no clear 
criterion. 


4. Minnesota Vocational Test for Clerical Workers 


This is an excellent example of a test oriented by job analysis, 
and of the general work-sample type. The content consists of pairs 
of numbers and pairs of names, which are to be checked as same 
or different. For example: 


5794367 5794267 
79542 79542 
John C. Linder John C. Lender 
Investors’ Syndicate Investors’ Subdicate 


Parts 1 and 3 consist of number items. Parts 2 and 4 consist of 
name items. 

Reported reliabilities are .75 for number checking and .93 for 
name checking. As to validation, male clerical workers do much 
better than an unselected population, and one must exceed 95% 
of the general population to do as well as the average clerk. Cor- 
relations from .60 to .70 have been reported with evaluated per- 
sonal histories. The test is found to be diagnostic of filing ability. 
It is relatively independent of general intelligence, and thus a true 
aptitude test. In its construction many experimental tests were 
tried out on clerical and general workers and on employed and 
unemployed clerical workers, and it was found to be the most 
sensitive instrument of differentiation discovered. 


* Reference: Lorge, 1941 a. 
+ Reference: Lorge, 1941 c. 
f References: Andrew and Paterson; Andrew; Green and Berman; Bingham 
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TESTS OF PROFESSIONAL AND ACADEMIC APTITUDE 


Here we encounter further attempts to work out and define 
effective and well-founded differentiating concepts, and to trans- 
late them into instruments of measurement. Two points should 
be emphasized in advance. (a) In all professional aptitude tests, 
the criterion on which by far most reliance is placed is not pro- 
fessional success, but success in professional studies. (b) Item 
content is strikingly similar to that found in general intelligence 
tests, combined with a certain degree of special reference or bias. 
It probably cannot be shown, and certainly has not been, that ob- 
tained scores are independent of or not highly correlated with 
intelligence test scores. So the probability is that they are aptitude 
tests only in name, and in reality intelligence tests oriented 
towards some special group or interest. 


1. Medical Aptitude Test * 


This test is a good example of the point that has just been made. 
It Was administered annually from Washington under the super- 
Vision of the Committee on Aptitude Tests for Medical Students 
of the American Association of Medical Colleges. Frequent revi- 
Sions were made, and each year data on the test, from some 600 
institutions using it, were collected, tabulated, and reported back 
With interpretive comments. 

Form 16 of the test (Moss, 1942) consists of 7 subtests as fol- 
lows. (1) Visual memory. (2) Memory for content. (3) Memory 
for content. (4) Scientific vocabulary. (5) Understanding of 
Printed material. (6) Scientific definitions. (7) Logical reasoning. 

he revisions which have been made from time to time have been 
based largely on studies of the predictive power of the subtests. 
Typical validation material has been presented in Table 3. This 
Shows the relationship of test scores to success in medical school. 
Beyond this a less determinate relationship was established be- 
tween the scores and success during internship. Kandel (q.0.), 
Considering its item content, is no doubt quite correct in consider- 
ing it in effect a general alertness test with a strong medical slant. 
Thus it does not deal with aptitude in the strict sense, and in this 
Tespect it is typical of many similar instruments. It represents an 
Interesting and apparently in the main successful undertaking, and 


* References: Kandel; Moss, all entries. 
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the fact that for various reasons it has recently been discontinued 
does not make it less instructive. 


2. Law Aptitude Examination * 


This is another special purpose test similar in general to the 
foregoing. For various reasons, most of them practical and ad- 
ministrative and having nothing to do with the intrinsic excellence 
of the instrument, it has not been as widely used as the Medical 
Aptitude Test. It has not gone through the same sequential re- 
visions, nor has extensive experience with its use accumulated and 
been subjected to tabulation and analysis to the same extent. 

The examination consists of 5 subtests as follows: (1) Accuracy 
of recall. (2) Reading comprehension using legal material. (3) 
Reasoning by analogy. (4) Reasoning by analysis. (5) Skill in 
Pure logic. Item content has a bias towards legal material. Once 
more, then, it seems to figure as a special purpose intelligence 
test rather than what might be defined as an aptitude test proper. 

This, too, is the character of the law aptitude test constructed 
by W. M. Adams (g.v.). It consists of 8 subtests, namely difficult 
analogies, mixed relations (giving the 2 most closely related words 
in groups of 6), opposites (selecting from 5 choices the word 
opposite in meaning to the key word), memory (using readings of 
judicial opinions), relevancy (using an involved legal case), read- 
ing comprehension (using judicial opinions), and legal informa- 
tion. Adams found that a test of this type has closer relationship 
to law school achievement than any of the usual predictive factors, 
all of which he investigated. 


3. Iowa Placement Examinations + 


These tests, which were developed and used in connection with 
the admissions and freshman guidance programs at the University 
of Iowa, deal with specific subject matter areas, including Eng- 
lish, mathematics, chemistry, French, and physics. The tests for 
each such area include two series; namely, a training series and 
an aptitude series. The aptitude series may properly be considered 
a psychometric instrument, whereas the training series is in effect 
an achievement test. The aptitude series is closest to the ordinary 
intelligence test, but contains acquired material from the subject 
in question. The battery yields a profile rather than a global score. 


* References: Ferson and Stoddard; Kandel, 
T References: Stoddard, 1926, 1943. 
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This is in keeping with Stoddard’s contention that global over-all 
scores are misleading in dealing with adults. In his account he 
refers to this battery as a “new departure in mental measurement.” 

To give an idea of the tests in the aptitude series, the Mathe- 
matics Aptitude Test, which has already been characterized here, 
includes subtests for ability to complete arithmetical and algebraic 
series, to solve originals requiring the use of spatial imagination, 
to use symbolic logic, and to interpret difficult mathematical 
reading. The Foreign Language Aptitude Test consists of subtests 
to reveal knowledge of English—parts of speech, inflexions and 
roots, transfer of training from English to Esperanto, skill in 
grammatical principles, reading comprehension, and translation 
from English to Esperanto. 

The general effectiveness and validity of the instrument may 
be judged from the data presented in Tables 26 and 27. With 
regard to the correlations in Table 26, it should be noted that they 
are between test performance and marks in the subject indicated, 


TABLE 26 


CORRELATIONS OF SUBJECT-MATTER MARKS, FIRST SEMESTER FRESHMAN 
YEAR, WITH IOWA PLACEMENT EXAMINATION RATINGS 


(Stoddard, 1928, Table 1, p. 96) 


Series I (Aptitude) Series II (Training) 


Subject (Testing time 40 min.) | (Testing time 80 min.) 
Chemistry ies 50 60 
ENElSh ase ais ame ona 55 60 
French... .60 .65 
Mathematics ‘55 60 
HYG ea .50 *55 
NOt averg « f th me 
Se mark hey are approximately 0 e 5a 
tr Ts Tus CNY i that have been 


bed Within the range of many correlations 
Of colleg etween intelligence test scores and averages of measures 
Achievement, and better than most correlations between 
Scores and special subject achievement. I 
pr bs: dard Very truly OnE out that such coefficients are difficult 
Ucational authorities to interpret, and considers that they 
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fail to tell the story of the relationship as clearly as might be 
wished. He considers that the real situation is better and more 
clearly revealed in the data presented in Table 27. It is quite 
evident that decile ratings on the Placement Examination are 
indicative and useful, and that the quartiles are very sharply dis- 
tinct. The indication is that chances for success at the university 
are forty times as great for some candidates as for others. 

Since some rather forceful claims have been made on behalf of 
this instrument as a noteworthy psychometric advance, it is of 
some interest to compare its predictive efficiency with that of more 
frequently used measures. At the University of Minnesota it was 
found that the best predictive index for freshman achievement 


TABLE 27 


PERCENTAGE OF STUDENTS IN DECILES AND QUARTILES OF IowA 
PLACEMENT EXAMINATIONS MAKING GRADES OF A OR B, C oR D, 
AND F IN CHEMISTRY IN FIRST SEMESTER FRESHMAN YEAR 


(Stoddard, 1928, Table 2, p. 97) 


CHEMISTRY APTITUDE | CHEMISTRY TRAINING 
SERIES SERIES 
‘TEST SCORES 
Deciles Grades Grades 
A,B EDR F|AB CD F 
68 3 I | 70 30 ° 1 
44 SI 5 5 47 2 | 
34 ST 91] 42 54 4 
27 64 9| 30 63 7 
26 60 14 | 30 60 ত K 
18 67 X51 | 22 52 19 | 
sd 63 59: SE 58 Iz 
12 63 25 | 26 66 8 | 
5 65 30 | 19 61 ণ 
$ 5I 46 8 64 নী 
upper quartile 5I 45 A 58 41 b 
upper middle quurtile....| 28 6I 22 ন) Kk 
lower middle gquartile....| 18 64 18 | 27 6G রর 


lower quartile ......e-e-| 6 60 34 16 61 23 


TNE = pee Cae - OO 
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consisted of a weighted score combining standing on the American 
Council on Education Psychological Examination and percentile 
rank in high school. The relationship of this weighted score to 
college achievement is shown in Table 28. It is nct possible to 
make a close comparison, because the data are reported in a dif- 
ferent form, but at any rate the decile predictive valves here and 
at Towa do not seem of an altogether different order. However, in 
the Towa report the relationship is with special subject marks, 
Whereas in the Minnesota report it is with average marks, which 
Would be expected to raise it substantially. 


TABLE 2S 


WEicHTED PERCENTILE RANKS IN ADMISSIONS CRITERION AND 
PRosABILITY OF MARKS OF C OR BETTER IN ARTS COLLEGE, 
UNIVERSITY OF MINNESOTA 


(v. University of Minnesota a.) 


Percent Making 

Weighted Percentile Rank N N Average Grade 

of C or Higher 
148 I4I 95.3 
239 175 73.2 
254 170 66.9 
223 108 48.4 
184 59 32.2. 
ISI 36 23.8 
104 24 23.0 
48 I2 25.0 
8  { 25.0 
I [oJ 0.0 
1360 727 53.5 


Turning to more comprehensive studies of the problem such as 
those of MacPhail (g.v.) and Remmers (1934 a), which summarize 
data accumulated over a considerable period of time and from 
many institutions, the central tendency of correlations between 
intelligence Scores and academic success is said to be between .40 
and .45 with a range from .13 to somewhat in excess of .70, and 
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about two-thirds of the coefficients lying between .30 and .60. 
However, as already pointed out, these are correlations with 
average measures of college achievement, and the relationship of 
intelligence scores to special subject success is more variable and 
tends to be lower. Thus it seems a reasonable conclusion that 
instruments like the Iowa Placement tests or the Medical Aptitude 
test, which might perhaps better be called special purpose intelli- 
gence tests than aptitude tests in the narrower and more specific 
sense, have a higher prognostic validity for that purpose than 
general intelligence tests. The problem calls for further and more 
precise analysis, but so far as it is possible to estimate the evi- 
dence, Stoddard’s claim to a distinctive advance in psychometric 
procedures would seem justified. 


4. Metropolitan Readiness Tests * 


This is a battery intended for use with young children in order 
to test readiness for reading instruction. It contains 7 subtests. 
(1) Similarities and opposites; (2) calls for the copying of forms; 
(3) and (4) test vocabulary and understanding of sentences by 
the use of pictures; (5) is a 40-item test including number vocabu- 
lary, counting, ordinal numbers, recognition of written numbers, 
writing numbers, interpretation of number symbols, meaning of 
numerical terms, meaning of fractional parts, recognition of forms, 
telling time, and use of numbers in simple problems; (6) is an 
information test; (7) calls for freehand drawing. Those who make 
a score of less than 60 are considered unready to read. The relia- 
bility is probably poor, which is virtually inevitable in the case of 
a group test for young children. The scores must be interpreted 
with much caution. The instrument is not unlike the Pintner- 
Cunningham Primary Test, which is now incorporated as the first 
level of the Pintner General Ability Tests. It is to be considered 
as a general intelligence test emphasizing “school readiness.” 


5. Algebra Prognosis Test 1 


This is an economical and ingenious test based upon what is 
in effect a job analysis of algebra, and tending towards the work- 
sample type. It consists of a series of short and simple lessons 
dealing with topics and problems like those encountered in a first 
course in algebra, such as the use of letters for numbers, the 


* Reference: Grant. 
fT Reference; Symonds, 1927. 
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omission of the “times” sign between letters and numbers, etc. 
Thus the task is not merely to perform, but rather to learn, pre- 
sumably in much the same way as the subject will have to learn 
in the algebra course itself. A correlation of .71 with achievement 
in algebra after one term is reported, which of course is rather 
impressively high. 


6. Latin Prognosis Test £ 


This is a companion test to the former. It consists of a series 
of very simple short lessons in Latin, e.g., on derivatives, singular 
and plural forms, gender, case. A test is placed at the end of each 
Short lesson, to see how well the subject has learned it. A correla- 
tion of .So with Latin achievement is reported, which is about as 
high as it can possibly be, considering that all the measures con- 
cerned are to some extent unreliable. 

Both these tests are interesting as embodying an important and 
very practical principle. This is the common-sense idea that a 
sample of a given function will provide excellent material for 
testing the probable efficiency of the function itself, and it is borne 
out by a number of investigations which indicate that initial per- 
formance on a learning task is closely related to final status. The 
high correlations with the criteria in each case are significant when 
this point of view is borne in mind. 

At is interesting to compare the last test discussed with a very 
different approach to the problem of Latin prognosis. Allen (g.v.), 
after an elaborate experimental and selective process, developed 
a battery of six tests intended to predict success in Latin. They 
were the Briggs Analogies Test Alpha and Beta, the Thorndike 
Test of Word Knowledge A and B, and the Rogers Interpolation 
Test I and 2. As a criterion he used a battery of eleven Latin 
achievement tests at the end of the first semester, and reported a 
multiple correlation between the battery and the criterion of .588. 
Clem (g.v.), in another study continuing this work, used the 
same criterion, and employed the Allen battery together with a 
number of additional factors, such as age, elementary school 
average, teacher ratings, high school marks in English and mathe- 
matics, etc. It was possible by using this large array of predictive 
indices in the best obtainable combination to obtain a correlation 
of .84 with the criterion. 

The contrast is very striking. An elaborate array of indices, 

¥ Reference: Symonds, 1927. 
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requiring much time and labor to amass, can yield at the best a 
validity barely superior to the short and simple Latin Prognosis 
Test against what is almost certainly a more reliable criterion. 
It is a powerful argument for the special purpose test, tied closely 
to the function to be measured, not only in Latin but elsewhere. 
Of course, neither the Latin Prognosis Test nor the two extensive 
batteries used in the studies just reported can be considered instru- 
ments for the measurement of aptitude in the strict sense of the 
term. That is, they do not correlate high with the criterion and 
low with most other factors, in particular general intelligence. One 
of the difficulties encountered by Allen and Clem was that their 
batteries predicted many other school subjects practically as well 
as and in some cases better than Latin. The same is probably true, 
though to a less extent, with the Latin Prognosis Test itself, 
although so far as the present writer knows, the point has not 
been investigated. Once again we seem to have a special purpose 
intelligence test (since achievement in Latin is pretty certainly 
“saturated” quite heavily with general intelligence), which may 
be considered an aptitude test if we are willing to extend the 
ordinary strict meaning of the term, which indeed seems legitimate 
enough. 


TALENT TEsTs 


In many treatments of psychometrics, so-called talent tests are 
set off in a classification by themselves, separated from what are 
termed aptitude tests. The present writer definitely prefers to deal 
with them under the general heading of aptitude testing, for the 
reason already given that, in the present condition of the field, 
fine terminological distinctions are unreal and lead only to spuri- 
ous and wastefully confusing argument. The term talent is often 
used to mean a special innate capacity distinctively separated 
from other capacities and peculiar to itself. Thus a talent tends 
to be thought of as a sort of unitary or monolithic mental entity, 
and virtually as a faculty, although the word is scrupulously 
avoided as a usual thing. However, we know almost nothing about 
the psychological structure of special talent. Also, our ideas in 
regard to its hereditary determination are of the vaguest, for it 
is always quite possible that even the greatest genius owes much 
to his home environment as well as to his parents’ chromosomes, 
What is perfectly clear is that we are dealing once again with the 
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aptitude problem under a slightly different name. Thus to drag 
in a terminological distinction which may correspond to reality 
but which may not, and in any case makes no real difference to 
actual procedure, seems gratuitous. For here as always, the psy- 
chometric problem is to set up and delimit an effective working 
concept that will yield a good instrument of measurement related 
to the contemplated criterion. 


1. Measures of Musical Talent (Seashore) * 


This celebrated battery of music talent tests has undergone 
several revisions and improvements since its first appearance. It 
now consists of 6 subtests. (1) A test of pitch discrimination, con- 
sisting of pairs of tones, the second of which differs in pitch from 
the first, beginning with readily discernible differences and pass- 
ing on to very fine ones. The task is to decide whether the second 
tone is lower or higher than the first. (2) A test of loudness dis- 
crimination, consisting of pairs of clicks, the second of which 
differs from the first in intensity, ranging again from readily dis- 
Cernible to very fine differences. (3) A test of time discrimination, 
Consisting of pairs of time intervals marked off by three clicks, 
the second differing from the first in duration. (4) A test of timbre 
Or tone quality, consisting of pairs of tones, the second of which 
differs in timbre or quality from the first. (5) A test of rhythm 
discrimination, consisting of pairs of rhythmic patterns presented 
in clicks or taps, the second of which is either the same as or 
different from the first. The items increase in complexity, i.e., 
length, as the test progresses. (6) A test of tonal memory, con- 
Sisting of pairs of tonal patterns intended to be devoid of melodic 
Significance, the second differing from the first by the alteration 
of one of the tones. The task is to indicate the altered element 
by number. 

The most important improvements which have been made in 
the battery are as follows. (a) The quality of the recordings by 
means of which the tests are presented has been made better as 
this became possible with improved technology. Electric record- 
ings have been substituted for the older mechanical recordings. 
(b) The consonance test which was formerly included, but which 
Proved unsatisfactory, has been dropped, and the timbre test sub- 
stituted. The construction of this latter subtest has been made 


* References: Seashore, 1919; Saetveit, Lewis, and Seashore; Mursell; Farnes- 
Worth, 1931. 
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possible by the development of an ingenious sound source capable 
of modifying some of the partials of a compound clang without 
changing the others. (c) The battery has been reorganized. In the 
past each subtest required the playing of two record faces, which 
was time-consuming and involved the inclusion of many non- 
discriminating items. It now consists of two series, A and B, 
including all 6 subtests, each series requiring only one record face 
for each of the 6 subtests. This has been accomplished without 
loss of reliability. Series A is the easier of the two, and is intended 
for “dragnet” purposes for use with heterogeneous groups. Series 
B is recommended for use with specialists in music and for appro- 
priate laboratory purposes. 

Tables of percentile norms are presented for different ages. The 
test is not recommended for children less than 8 years of age. Sea- 
shore has repeatedly insisted that total scores on the battery can- 
not represent the true meaning of the performance it elicits. 
Profiles on the 6 subtests should always be used, and whatever 
judgments are made should be based on them. Unfortunately, as 
so often happens with correct psychometric procedure, this 
insistent recommendation has been widely overlooked by users of 
the tests. There is, of course, a question as to whether the subtests 
have sufficient reliability to yield meaningful profiles where differ- 

" ences are slight. 

Quite apart from practical utility, the battery is of great Sys- 
tematic interest. It is the outstanding instance of an aptitude (or 
talent) test embodying consistently a very explicit Psychological 
viewpoint. The usual procedures in test construction were not fol- 
lowed. Items were not selected on the basis of any direct empirical 
study of musical behavior, or in terms of their power to discrimi- 
nate levels of excellence in such behavior. In Seashore’s view, it is 
possible to make an essentially theoretical breakdown of musical 
talent into a large number of components, some of which can be 
translated into objective instruments of measurement, while many 
of them cannot. Attention should be called to his repeated in- 
sistence that the battery does not and cannot measure all or nearly 
all the significant components of musical talent. Disregard of this 
express and conscious limitation has led to numerous quite illegiti- 
mate criticisms. 

Nevertheless, the question of validation, as always, is para- 
mount. One may debate the theory, but one cannot avoid asking 
how and whether it actually works out. Are pitch discrimination, 
intensity discrimination, time discrimination, timbre discrimi- 
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nation, rhythmic discrimination, and tonal memory, as embodied 
in the instrument, indeed significant and revealing components, 
at least in part, of musical talent? Various attempts have been 
made to validate the measures against general criteria such as 
teacher ratings on musicality, musical achievement, and the like. 
None of the obtained correlations are high, and most of them are 
low (Farnesworth, 1937; Mursell). Stanton has reported (g.v.) 
that the test predicts success in the Eastman School of Music 
reasonably well; but in her work she teamed it with the Iowa 
Comprehension Test, and she does not present her data in such a 
way that one can separate the predictive efficiency of the two in- 
struments. Possibly the intelligence test alone would work as well. 

Seashore, however, has protested against such over-all valida- 
tion against general criteria. In his view the pitch test, for instance, 
should be valid not for over-all musicality, but for certain highly 
specialized functions, such as a violinist’s ability to make very 
fine shadings and gradients of tone. Indeed each one of the sub- 
tests should be validated against a different pattern of fine and 
Special musical functions. This, of course, would convert validation 
into an intricate laboratory problem, to which in itself no reason- 
able exception can be taken. But it means first that the instrument 
has not in fact been validated, and second that its use for general 
classification and selection, e.g., for securing members of a high 
school orchestra, becomes very doubtful. 

In summary, it seems correct to say that the test deals funda- 
mentally with various functions of auditory acuity and discrimi- 
nation. Whether such functions are true components of musical 
talent, which presumably turns on higher mental integrations, may 
well be questioned. If they are not, then the battery still has a 
negative value, analogous to that of an oculist’s color chart in 
relation to artistic talent. It would successfully reveal those who 
do not hear well enough to function musically with success. But it 
Would not reveal positive components of musical talent. 


2. Interval Discrimination Test * 


This is another music talent test constructed in terms of an 
hypothetical premise. The presupposition is that the ability to 
make fine discriminations of intervallic quality (e.g., between 
thirds and sixths sounded simultaneously as chords), is an index of 
Capacity for musical behavior. 

The test is a series of items each consisting of a short set of 

* Reference: Madison. 


250 PSYCHOLOGICAL TESTING 


intervals sounded as bi-chords (e.g., C-E, C-G, etc.). In each item, 
one of the intervals sounded is different from the rest included, the 
task being to identify it. The discriminations run from very easy 
to very difficult. It is too early to offer a comprehensive report, as 
the test is still in an experimental stage. However, reported re- 
liabilities are high, and some very high correlations with over-all 
criteria of musicality have been found. The essential difference 
between this test and the Seashore Measures is that it consists of 
actual musical material (i.e., sounds that actually occur in music) 
Whereas Seashore explicitly avoided such content, and that it re- 
quires what is probably a higher level of perceptual discrimina- 
tions. 


3. Musical Memory Test (Drake)* 


The test consists of « series of pairs of melodic items, to be 
Played on the piano or other suitable instrument. The second mem- 
ber of each pair is different from the first either in key, or time, or 
notes. The task is to indicate in which of the three respects the 
difference lies. A reliability of .93 has been reported. Percentile 
norms for ages 7 to 23 are given in the instruction manual. 

Drake does not consider this to be a test of musical achievement, 
but of musical talent, of which he believes memory or melodic re- 
tentiveness to be an important indication. There are considerable 
general arguments in favor of his position. Memory items have 
been widely used in tests of intelligence. Feats of memorization 
and retention recur in the biographies of great musicians. And 
Kate Gordon (g.v.) has shown, although with too few Subjects to 
make her conclusions entirely convincing, that there are enor- 
mous differences in the memory performance of those Who use 
“musical” and “unmusical” methods. Drake has been able to 
report fairly high correlations between scores on his test and 
global criteria of musicality. 


4. Tests in Fundamental Abilities of Visual Arts (Lewerenz)+ 


The test consists of three parts, each Containing various sub- 
tests. Part I. Preferences for proportion as shown in different de- 
Signs, the task being to choose between 4 Variations of one theme ; 
Originality of line, the task being to draw lines between dots 
Printed on the test blank so as to make a Picture, 10 items in all. 


* References: Drake, 1933 a, b. 
T References: Lewerenz; Milton H. Bird. 
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Part II. Indication of omissions of shadows in 10 drawings; a 50- 
item vocabulary test dealing with art materials and processes, 
drawing terms, and elements in pictures; an immediate memory 
test, the task being to reproduce part of a picture of a vase from 
memory. Part III. Three tests calling for an indication of errors 
in pictures showing cylindrical, parallel, and angular perspective; 
a color matching test, with 6 key colors to be matched with 46 
variations in hue and shade. 

The test was constructed in connection with work in art educa- 
tion in the schools of Los Angeles and has been checked for validity 
against classroom data there. It correlates .40 with marks in art 
classes. A reliability of .87 has been reported for 100 pupils in 
grades 3 to 9. Decidedly the most interesting and seemingly perti- 
nent of the subtests is that requiring the subject to draw lines 
making a picture between patterns of dots. This calls for imagina- 
tion and initiative and is often thought to have some projective 
Significance. Next to this in excellence is probably the first subtest, 
calling for judgments of proportion. The others are of doubtful 
relevance to art talent. 


5. McAdory Art Test * 


The test consists of 72 plates, each with 4 variations of 1 picture. 
The plates are not large enough for use with a group of consider- 
able size. The materials are drawn from current art and trade mag- 
azines and include objects of common use, costume items, textile 
designs, pictures, etc. In the 4 variants there are changes from the 
Original in proportion, intensity, and color. The subject records his 
Preference in each of the 72 cases. In the scoring, 1 point credit is 
Slven for each agreement with the key, which represents the judg- 
ments of a jury of 100 experts, including artists, architects, art 
teachers, and critics. The items included in the test were those on 
Which there was agreement by at least 64% of the judges. A total 
Score for the whole test can be computed. Also there are totals for 
each of its 6 divisions (furniture and utensils, texture and cloth- 
Ing, architecture and related arts, shape and line arrangements, 
massing of light and dark, color schemes). 

The test is competently constructed and standardized. Grade 
norms are made available. Its validity, however, is dubious. Much 
of the pictorial material is now outdated and has a queer appear- 
ance, notably the costume pictures. 

* References: McAdory ; Milton H. Bird. 
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6. Meier Art Test. I. Art Judgment * 


The test consists of 100 paired comparisons of pictures, the 
pictures of each pair differing in one respect, e.g., position of the 
moon, size of topsail, etc. The difference is specified, and the sub- 
ject is required to decide which member of the pair is better, i.e., 
more pleasing, more artistic, more satisfying. 

It is a revision of the Meier-Seashore Art Judgment Test, which 
had 125 items of similar kind, reduced from an original 300. 
Criteria for item selection in the Meier-Seashore Test were: (a) 
reputability of the art work, (b) exemplification of some aesthetic 
principle, (c) suitability for testing. Each item was submitted to 
25 art experts, and the resulting experimental form of the test 
was given to 1081 individuals. Final selection of items was made 
on (a) favorable reaction of the experts, (b) 60 to 90% preference 
by the subjects. The present test was derived from the previous 
one by an analysis of the prognostic value and relative consistency 
of the items, using biserial correlation. Of the previous 125 items, 
the 25 worst were eliminated, and the 25 best given an additional 
point in a new weighted scoring system. In comparison with the 
previous test, a wider distribution of scores was obtained, which 
“suggested an enhanced validity for the new form of the test” 
(Manual, p. 14). There is also cited “additional evidence of 
validity,” among which the following points are important. Corre- 
lations with intelligence are negligible, running from —.14 to .28. 
Adults of superior intelligence do not rate as high as members of 
art faculties. Art experts obtain high scores. The fact that a junior 
high school pupil (age 12) without art training may score as well 
as a trained adult is thought to suggest that the test measures 
innate ability. Reliabilities of .70 to .84 are reported, without 
specifying the type of coefficient. The test is designed to indicate 
probable art talent in the general population. Percentile norms are 
presented, based on more than gooo junior and senior high school 
pupils studying art. The following interpretations are suggested. 
First quartile (100-76) : other things being equal, almost certain 
to succeed in an art career, especially if the individual has craft 
skill in his ancestry, good intelligence, and initiative and energy. 
Second quartile (75-57): high average art judgment, which, if 
it is associated with other favorable traits, makes it possible to 
expect much. Third quartile (50-26): still more compensating 


* References: Meier, 1926, 1942. 
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factors would be needed if success were to be anticipated. Fourth 
Quartile (25-1): should take other tests, and inquire further. If 
the rating were corroborated, advice would be against attempting 
an art career. 

As a footnote to the six tests so far discussed, it may be re- 
marked that the word talent tends quite strongly to be confined to 
artistic, musical, and perhaps literary pursuits, although the limi- 
tation is not absolute. Is there any reason for speaking of mechani- 
cal “aptitude” and of artistic “talent,” and for confining the term 
to these connections? Obviously not. Clearly we are dealing with 
nothing deeper than common usage, for there is no evidence for 
Supposing or reason to believe that “aptitudes” and “talents” are 
in any way really different psychological functions. 


7. Stanford Scientific Aptitude Test (Zyve)* 


Zyve has undertaken to analyze “scientific aptitude” into the 
following components. (1) The ability to make and recognize clear 
definitions. (2) The tendency to suspend judgment when evidence 
is insufficient, as contrasted to making snap judgments. (3) An 
experimental bent. (4) Power of discriminating values in the selec- 
tion and arrangement of data. (5) Power to detect fallacies and 
contradictions. (6) Reasoning. (7) The accumulation of systematic 
Observations. (8) Induction, deduction, generalization. (9) Accu- 
racy of understanding and interpretation. (10) Caution. These 
functions constitute the 10 subtests. Correlations of .95, .77, and 
‘89 are reported with “competent judgments on the abilities of 
50 research students in science.” No other validation is offered, 
and this material is not adequately analyzed. Crawford (g.v.) has 
reported correlations of only .30 between the test scores and sub- 
sequent marks in science for 143 Yale freshmen, and a reliability 
of only .60 for the whole test. Benton and Perry (gq.v.) report 
Correlations of .30 to .37 between scores on the Zyve test, and 
grades in science over four college years for 43 students. Marshall 
(g.v.) reports correlations with grades in science courses in 
college as follows: with freshman-sophomore science grades, 
‘404 + .09 (N = 47); with junior and senior science grades, 
‘345 = .09 (N = 43); with average chemistry grades, .369 + .085 
(N = 47); with average physics grades, .423 + .08 (N= 486); 
with average biology grades, .523 # .07 (N = 46). It is extremely 
Probable that the data reported in the last three studies represent 

* References: Zyve; Bingham; Crawford, 1941 b. Benton and Perry. 
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the predictive value of the test more accurately than Zyve’s re- 
ported correlations with instructor ratings on scientific ability. 


CONCLUSIONS 


There are of course an enormous number of instruments under 
the general classification of aptitude tests as here understood. But 
the samples that have been discussed are representative of the 
methods used in construction, the derivation of interpretive norms, 
and validation; and it may be said with some confidence that 
there are none decisively better. Thus the present writer believes 
that a not seriously misleading picture of what has been accom- 
plished is presented in spite of the necessary narrowness of the 
selection. 

It would seem that those tests are most superior in the essential 
characteristic of validity in which the central concept is most 
directly derived from the function to be measured. Such are in 
the first instance the Latin Prognosis Test and the Algebra Prog- 
nosis Test, with the mechanical assembly tests and motility or dex- 
terity tests displaying a less certain relationship to their criteria. 
But in each of such cases, the frame of reference and the basic 
concept are rigidly limited. When it comes to instruments like the 
Medical Aptitude Test or the Ilowa Placement Examination, which 
are essentially general mental tests with a special slant and pur- 
pose, there seems to be a superior validity for that particular pur- 
pose to the general intelligence test. When it comes to tests con- 
structed about some broad psychological analysis of the function, 
like the Seashore or the Zyve, they are interesting and noteworthy 
in direct proportion to the debatability and seriousness of the 
general position itself. But if they possess practicable validity, this 
has not been satisfactorily shown. It may be doubted very much 
whether effective aptitude tests can be constructed in terms of such 
a logic, partly because psychological science has not advanced far 
enough, and partly because there is great psychological overlapping 
in functional operations. To enjoy or produce music, to paint pic- 
tures, to learn mathematics, to handle groups of human beings, 
and so forth, are activities which, in all probability, have a great 
deal in common, psychologically speaking. Of course this com- 
monality may be suppused to be greater or less in specific cases. 
But we never know quite how great it is, and all boundary lines 
are hazy. Thus psychological theory and general analysis are 
probably incompetent to define and specify working concepts 
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about which valid tests can be built. The best instruments derive 
their validity from concepts empirically defined in terms of the 
job or function itself. 

Thus it has to be recognized that the whole immense amount of 
work that has been done in developing tests of what are variously 
called aptitudes, special abilities, or talents has contributed 
nothing of signal importance towards the establishment of authen- 
tic psychodiagnostic instruments. Profiles of one sort or another 
Can certainly be secured by administering batteries of tests of 
varied kinds, or instruments with a variety of subtests such as the 
Detroit General Aptitudes Examination. But such profiles are no 
more than summations of differential scoring on the tests them- 
selves, and cannot be taken as representing a valid picture of men- 
tal organization itself. The reason is that the most successful of 
Such tests are built about concepts defined in terms of the func- 
tions to be measured, the nature itself of which is psychologically 
undetermined. Because of their very nature and logic, the measur- 
ing instruments cannot penetrate below this level. And if we turn 
to tests projected on broad psychological considerations, the 
trouble then is that their relation to any recognizable practical 
function is in doubt. 


SUGGESTED ADDITIONAL READINGS 


For additional reading and more intensive study of the material in 
this chapter the most important sources are the tests and more par- 
ticularly the manuals of the tests discussed. Publishers will be found 
listed in the bibliography of tests at the end of the book. Also the 
references mentioned in the text.in connection with the various tests 
may be consulted. Further readings and suggestions are as follows: 

Walter V. Bingham, Aptitudes and aptitude testing (New York: 
Harper and Brothers, 1937), Chapters 2 and 3, “The theory of apti- 
tude”; Chapter 16, “Selection of tests.” A broad treatment of the 
theory of aptitude in the first reference, and of numerous general 
Problems of aptitude testing in the second. 

Clark L. Hull, Aptitude testing (Yonkers-on-Hudson, N. Y.: World 
Book Co., 1928), Chapter 6, “The basic theory of aptitudes and 
tests.” Discusses a number of basic theoretical issues. 

Frank N. Freeman, Mental tests: their history, principles, and 
applications (Rev. ed.; Boston: Houghton Mifflin Company, 1939), 
Chapter 7, “Tests for the analysis of mental capacity.” Aptitude-type 
tests discussed from an interesting interpretive viewpoint. 

Donald G. Paterson, Richard M. Elliott, L. Dewey Anderson, Edna 
Heidbreder, Minnesota Mechanical Ability Tests (Minneapolis: Uni- 
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versity of Minnesota Press, 1930), Chapter 16, “Summary and sig- 
nificance of the results.” A summary and interpretation of their 
elaborate studies on mechanical aptitude tests. 

Oscar Krisen Buros (editor), The 1938 mental measurements year- 
book (New Brunswick, N. J.: Rutgers University Press, 1938); also 
The 1940 mental measurements yearbook (Highland Park, N. J.: The 
Mental Measurements Yearbook, 1941). These valuable reference 
works should be consulted for test reviews and bibliographies. 

Gertrude H. Hildreth, A bibliography of mental tests and rating 
scales (2nd ed.; New York: Psychological Corporation, 1939); also 
Bibliography of mental tests and rating scales. 1945 supplement 
(New York: Psychological Corporation, 1945). These very complete 
bibliographies are most useful sources for any additional tests that 
may be desired. 


QUESTIONS FOR DISCUSSION 


, -. What.would be some of the practical advantages if a measurable 
function or process of general dexterity or motility had been dis- 
covered? 

2. What subtests of the Detroit General Aptitudes Examination 
resemble subtests in intelligence tests that have been discussed? Par- 
ticularize. 

3. Of the tests discussed in this chapter, which are aptitude tests 
according to the definitions of aptitude which have been quoted? 
Give reasons. 

4. ‘To what extent might the Latin Prognosis Test and the Algebra 
Prognosis Test measure general intelligence? Why? 

5. Can you make any suggestions for material for tests similar to 
the two just mentioned for chemistry, English composition, manual 
training, typewriting? Would such prognosis tests tend to measure 
general intelligence? 

6. Could we say that the Seashore Measures are based on a psy- 
chological theory of musicality, while the Interval test is based on a 
partial analysis of the musician’s job? 

7. Would the orientation of general intelligence tests towards 
school work make them in effect “special purpose tests” analogous to 
those discussed? Where would the differences, if any, lie? 

8. Would an individual profile based on a battery of intelligence 
tests and aptitude tests be anything more or reveal anything more 
than a pattern of “global scores”? Would the criticism of “global 
scores” apply to it? | 

9. Might there be any difference in psychological significance be- 
tween such a profile and that yielded by the Chicago Tests of Primary 
Mental Abilities? Just what, if any, would the difference be? 


CHAPTER VIII 


TESTS OF PERSONALITY, INTEREST, ATTITUDE, 
AND CHARACTER 


THE AREA: ITS DELIMITATION AND CHARACTERISTICS 


A great many instruments for measurement and appraisal be- 
long in the broad area roughly indicated by the title of this 
chapter. They range all the way from techniques for the observa- 
tion of subjects in a normal life setting to psychometric tests of 
more or less conventional form and construction. Projective instru- 
ments also have been treated in connection with it, but since their 
Whole theory and approach is distinctive, they are dealt with else- 
Where in this book. Three comments need to be made in advance in’ 
regard to the characteristics of the area under consideration before 
Passing on to consider typical instruments and techniques. 

. I. The terminology is vague. Indeed, rigorous attempts to estab- 
lish precise and sharply defined distinctions probably create diffi- 
Culties rather than removing them. Thus the reader should be 
Warned in advance that words here cannot be said to have uni- 
versally accepted meanings, and that they are often not used in 
exactly the same sense by all writers, or even by the same writer 
in different places. Thus E. B. Greene (g.v.), who is unusually 
Careful to define his terms, speaks of the whole field as having to 
do with what he calls “modes of adjustment,” by which he means 

Ways in which a person approaches a goal.” This certainly covers 
a sufficiently wide extent of territory! But it is not inappropriate. 
Again, the word trait is frequently employed, and it is roughly de- 
fined as a fairly consistent and specific mode of behavior, which 
Once again would make boundary lines none too easy to draw. 
Temperament, once again, is taken to mean a group of more or 
less similar traits, or a pervasive and inclusive trend of the per- 
Sonality. It is possible to find at least two different connotations 
of the term attitude. An attitude may be thought of and dealt with 
aS a tendency to react and feelin a certain way about some specific 
Issue or problem, such as free speech, communism, or war. Or it 
may be used to stand for a generalized tendency to approach life 
intellectually, aesthetically, or in terms of religion, in which case 
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it may be equivalent to the word value. So one might go on. The 
point is not to be confused by mere terminology, or to strive fruit- 
lessly for clean-cut, mutually exclusive classifications, or to waste 
time and energy in seeking sharp definitions and distinctions which 
the phenomena themselves render impossible, at least in our pres- 
ent stage of understanding. 

The investigations of Raymond Cattell (g.v.) constitute what 
is undoubtedly the most elaborate and far-reaching attempt now 
being made to explore and clarify the ideas and concepts of this 
field. Insisting on the importance of distinguishing between trait 
modalities (1946 a), he presents operational definitions according 
to which dynamic traits are those which respond to changes in 
incentives, abilities respond to alterations in the complexity of 
the path to a goal, and temperament is of all trait-types the least 
responsive to field changes. Cattell’s general position is that many 
So-called mental abilities are resultants of, or at least very closely 
associated with, personality factors of an apparently different 
kind. He suggests that verbal ability, for instance, may be associ- 
ated with lack of sociability, which results in preference for 
books over people, and that mathematical ability is associated with 
lack of dominance. By means of an elaborate factorial analysis, 
he arrives at a list of 12 basic or primary personality character- 
istics or traits. It must be confessed that at the present time this 
Work remains remote from the practical uses of psychometrics, 
and that for the most part it is no more than a promise, or perhaps 
a hope, of clearer working concepts to come. However, Cattell 
demonstrates that a personality trait recognizable in adult life 
can express itself in a miniature laboratory situation, also in 
recognizable form, which provides an ultimate working basis for 
test construction (1941, 1944). 

2. This vagueness should not blind us to the great importance 
of the field, or to the great value of even reasonably good instru- 
ments of measurement and evaluation for dealing with it, if such 
can be found. As a matter of fact, good instruments of this type do 
exist, although unfortunately there are also a great many bad ones 
as well. So far as the importance of the area is concerned, this is 
a matter of common-sense observation, and it has been confirmed 
by a number of investigations. 

Thus Farmer (9.v.) studied the reactions of 259 boys on a cube 
construction test, and was able to differentiate several types. (a) 
First, there was the completely controlled type, showing complete 
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mastery of the situation, not thinking ahead, and not worrying. 
(b) Second, there were the “thinkers,” who tended to think ahead 
in abstract terms and symbols including language and to plan 
before they acted. (c) Third, there were the “good workers,” con- 
Scientious, plodding, going ahead even when baffled. (d) Fourth, 
there were the “fools,” inept, never knowing what to do next. (e) 
The fifth was a miscellaneous category, containing those with no 
Very definite characteristics in action. After three years of train- 
ing, the industrial proficiency of these boys was measured by a 
practical examination, and a decided difference in favor of the 
first three types as contrasted with the last two was revealed. 
This, of course, is not a conclusive study, but it is suggestive. If 
instruments for the appraisal of personality and temperament 
Can be shown to have prognostic efficiency after the lapse of three 
Years, it is clear first that the tests themselves are useful, and 
Second that the factors revealed are of major importance. 

Once again the Pannenborgs (g.v.), in a notable study, found a 
remarkable consistency in traits and general temperament among 
Persons who are musically talented and musically active. They 
made a study of 3,860 children of whom 494 were known to be 
musical, 423 professional adult musicians, and the biographies of 
21 eminent composers, and report a striking trait similarity among 
all three groups. The groups of musically talented and musically 
active persons are decidedly above the average in physical activ- 
ity, not very industrious, highly emotional in their reactions to the 
Circumstances of life, intellectually versatile with high literary and 
Artistic interests, imaginative sometimes to a pathological degree, 
not orderly, punctual, or “scientific,” endowed with strong vital 
needs, fond of eating and sexual expression, interested in the oppo- 
Site Sex, physically healthy, but often nervously unstable. The 
Seneral purport of this work is that what is ordinarily called musi- 
Cal talent is by no means an isolated and special quasi-faculty, but 
A general and pervasive setting of the entire personality. 

But perhaps no research investigations are really needed to 
demonstrate the paramount importance of the area here under 
Consideration. K [? 

3. As typical instances of the psychometric work in th 
Come up for discussion, it will become evident once again, and 
Perhaps with special force and clarity, that everything depends on 
the isolation of the proper working concept if the resulting instru- 
Ment is to have any value. To isolate and define a concept that will 
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lead to a clearer understanding and a better control of mentality 
or behavior (whichever term one prefers), and then to translate it 
into viable items, is the logical pattern of test construction here as 
elsewhere. But in some respects the process is thrown into sharper 
relief, and its essential nature more manifestly revealed in con- 
nection with the measurement and appraisal of personality, inter- 
est, attitude, and character than in any other connection. 


MEASURES OF PERSONALITY 


Here workers in psychometrics have found themselves coping 
with a most inclusive concept. The definition put forward by All- 
port (1927) is probably as good as any. Personality is said to 
mean “the individual’s characteristic reactions to social situations, 
and his adaptation to the social features of his environment.” All- 
port has been able to find within the meaning so indicated certain 
psychometric leads. He considers the prime factors in personality 
to be first, intelligence or general adaptability ; second, motility or 
speed of reaction ; third, temperament or the individual’s prevail- 
ing emotional reactions or moods ; fourth, sociality or tendency to 
social participation ; fifth, the individual’s manner of solving social 
problems. All these obviously overlap. Measures which are con- 
sidered to have to do with personality emphasize the last three. 

The instruments to be considered here, and which are represent- 
ative of a very large number of similar examples, may be classi- 
fied and interpreted in terms of their use of one of three prevail- 
ing modes of technical treatment and approach. (a) The first of 
these is the use of self-rating items centering about concepts em- 
pirically isolated and defined. (b) The second is the use of self- 
rating or self-revealing items centering about concepts defined by 
systematic psychiatry. (c) The third niude of technical approach 
is by ratings made by someone other than the subject himself. Of 
the tests to be discussed below, numbers 1, 2, and 3 belong to the 
first category, numbers 4, 5, and 6 to the second, numbers 7 and 8 
to the third, while number 9 is of a special type. 


1. The Adjustment Inventory (Bell)* 


This is a self-questionnaire. It consists of 160 questions, such as 
the following: Do you daydream frequently? Did you ever have 
a strong desire to run away from home? Do you take cold rather 

* Reference: Bell. 
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easily from other people? Do you enjoy social gatherings? Are you 
often sorry for yourself? According to the instructions, these ques- 
tions are to be answered “frankly and honestly.” 

The scoring of the answers is intended to reveal the subject's 
status with reference to 5 aspects of adjustment. (1) Home adjust= 
ment, i.e., whether he is satisfied with his home life and associa- 
tions. (2) Health adjustment, i.e., whether he has been ill much, 
has had operations, suffers from minor ailments. (3) Social adjust- 
ment, i.e., whether he is shy, retiring, submissive. (4) Emotional 
adjustment, i.e., whether he is easily disturbed, nervous, depressed. 
(5) Occupational adjustment, i.e., whether he is satisfied with his 
job, its associations, conditions, etc. The endeavor to insure that 
this would be a valid instrument turned on item selection. Items 
were chosen on clinical experience, on their correlation with similar 
items in other such inventories, and on their power to discriminate 
between well-adjusted and ill-adjusted persons. A reliability of .94 
is reported for the whole instrument. | 

‘The inventory is internally consistent, but it is clear that the 
categories of adjustment, so called, are entirely empirical. Also it 
is doubtful whether self-questioning of this direct and obvious 
kind can yield authentic insights, and of course there is the ques- 
tion whether the five types of adjustment are really distinct, fun- 
damental, and meaningful. Moreover it is also open to much doubt 
whether the questions can be answered yes or no, as the inventory 
requires. It is, however, competently set up, and within the very 
grave limitations indicated, probably about as good as such instru- 
ments, of which there are many, can very well be. 


2. California Test of Personality * 


This is another fairly “ wical self-questionnaire. It calls for yes- 
or-no responses. The total score is divided into two major parts: 
first, self-adjustment, which includes self-reliance, sense of per- 
sonal worth, sense of personal freedom, feeling of belonging, 
freedom from withdrawing tendencies, and freedom from nervous 
symptoms; and social adjustment, which includes social stand- 
ards, social skills, freedom from antisocial tendencies, family rela- 
tions, school relations, and community relations. The instrument 
yields a total score, a self-adjustment score, and a social adjust- 
ment score, and also a profile indicating status on the various ad- 


* References: Tiegs, Clark, and Thorpe; R. Cattell, 1941 a; Symonds, 1941; 
Vernon, 1041. 
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justment factors. Altho 


the construction and revision of the test, they failed to yield 
usable and fruitful out 


based on what the authors speak of as “logical analysis,” 
as experience, the jud 


statistical research. In framing the items 


The concept of adj 
the whole instrument stands Or f. 


3. Personality Quotient Test (Link)* 
This is yet another Self-questionnair 
ntains such 
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In fact, the author of the instrument (Link) attacks the require- 
ment of validation, which he asserts is of secondary importance. 
The reason for this is that the instrument turns upon his concep- 
tion of personality as “the possession of habits which interest and 
serve other people.” Link feels that if the test is logically con- 
structed to embody this concept, its external validation does not 
matter. - 

In terms of this concept he proposes to interpret obtained scores 
by converting them into “personality quotients” on the model of 
the I.Q. This so-called “personality quotient” is obtained by find- 
ing the difference between the subject’s obtained score and the 
mean score for his age group, dividing the result by the standard 
deviation of the group, and multiplying the result by 17. The 
Constant 17 is chosen because it is the approximate standard devia- 
tion of intelligence quotients as obtained by the Stanford-Binet 
Scale. Needless to say, this is a piece of pseudo science completely 
irrelevant to the matter in hand. In fact the whole set of statistical 
Ei paEons, and the resultant “quotient,” are completely far- 
cical. 

This instrument has been presented to give the reader some idea 
of how bad a thoroughly bad test in this area can be. Obviously 
dubious answers to impudent questions are interpreted in terms of 
a scheme of concepts which have no assignable meaning or diag- 
nostic significance. And then, presumably to give them a show of 
authenticity with the public, they are distorted by irrelevant 
Statistical treatment into a score which has a sound analogous to 
the most widely known of all psychometric measures, the very 
name of which falsifies the method employed to calculate it. Tests 
of personality turning upon empirical concepts are always open 
to the suspicion of superficiality. But at least they can be con- 
Structed in such a way that whatever meaning they may have is 
reported in the scores in an honest and straightforward fashion. We 
turn now to two personality tests whose basic concepts are drawn 
from psychiatry rather than from empirical analysis and conjec- 


ture. 
4. Personality Inventory (Bernreuter)* 


This is among the best known instruments in the field of per- 
sonality measurement. It consists of 125 items of the self-rating or 
self-questionnaire type. The items were chosen in part from pre- 


* References: Bernreuter; Flanagan, 1935. 
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vious personality tests, the basis of choice being their power to 
differentiate between persons rated high and low on the contem- 
plated criteria, which will be explained below. Sample items are 
the following: Does it make you uncomfortable to be unconven- 
tional or “different”? Do you often feel miserable? Do you try to 
get your way even if you have to fight for it? 

The basic technical novelty is to treat each item-response as 
indicative of several different traits. In the original inventory 4 
Such traits were set up, but 2 more have been added due to the 
work of Flanagan so that there are now 6. They are as follows. 
(1) B1-N, Neurotic Tendency, i.e., the trait of emotional instabil- 
ity. (2) B2-S, Self-Sufficiency, i.e., the trait of rarely asking for 
sympathy, ignoring advice, liking to be alone. (3) B3-I, Introver- 
sion-Extroversion, i.e., the trait of being imaginative, living in 
oneself. (4) B4-D, Dominance-Submission, i.e., the trait of domi- 
nating others in face-to-face relationships. (5) F1-C, Self-Confi- 
dence, i.e., when high the trait of hampering self-consciousness, 
when low the trait of wholesome self-confidence. (6) F2-S, Socia- 
bility, i.e., the trait of Fenn nonsocial, indifferent, which is indi- 
cated by a high score here. ! 

The operation of the scoring scheme may be gathered from 
Figure 25. The scores of the three possible responses—yes, no, 


Do you like to bear responsibilities alone? 


BI-N | B2-S B3-I B4-D Fi-C F2-S 
Neurotic | Self- | Introversion-| Dominance| Self- Safa 
Tend- | Suffi- Extro- Submis- | Confi- bility 
ency | ciency version sion dence 
AR ics ta য় 4 —I 3 4 4 
ini ae aaa 2 —4 2 —3 cg = 
Doubtful.|  —2 1 —I 2 —2 —2 


Fic. 25. SCORING O¥ ONE ITEM ON BERNREUTER INVENTORY ON SIX TRAITS 


doubtful—to one item are presented. The score Values run from 
4-5 to —5. As will be seen, a positive answer to the question “Do 
you like to bear responsibilities alone?” carries a score of +4 on 
Self-Confidence, and of —1 on Neurotic Tendency. A negative 
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answer carries a score of —5 on Self-Confidence, and +2 on 
Neurotic Tendency, and so on. Similar sixfold values are assigned 
to all three possible answers to the 125 questions which make up 
the inventory. 
Lee addition of the two “F” traits came about in the develop- 
ent of the inventory. Bernreuter found that neurotic tendency 
and introversion-extroversion correlated about .95, so that they 
were virtually the same. Thus only 3 of the original 4 “B” ratings 
Were needed. Flanagan, by means of factor analysis, drew the 
conclusion that two traits, which he named self-confidence and 
sociability and characterized as above, were the chief components 
of the first 4. So only the two “FE” traits are necessary, although all 
6 are retained. 

Two simplified scoring plans for the inventory have been pub- 
lished. The first of these (Kempfer) rates all values from --3 to 
73 as 0, 4 and above becomes 1, —4 and below becomes —1. It 
is not suitable for accurate work, but useful in the rapid location 
of extreme cases. The ‘other plan (McClelland) counts all answers 
weighted --3 or more, and subtracts all weighted —3 or less for 
each trait scale. The resulting scores are reported as correlating 
with the full scale from .95 to .84 for various traits. Both plans 
greatly reduce time and labor. 

Lorge (1935 a, b) has made a highly critical analysis of the 
Inventory. (a) He finds reliabilities of .88 for scores on Neurotic 
Tendency, .80 for scores on Self-Sufficiency, .87 for scores on 
Introversion-Extroversion, ‘85 for scores on Dominance-Submis- 
Sion. These are not sufficient for individual diagnosis. Further- 
more, he finds that the separate traits are not self-consistent, i.e., 
that the item scores on the total of which the rating on each trait 
is determined are inconsistent with one another. He also finds 
that the traits are not independent. (b) He argues that the classi- 
fication Of traits is not valid, and that they are mere “fiat” traits, 
Statistical artifacts, rather than authentic factors in mental organ- 
ization. In other words, the concepts on which the instrument is 
built and in terms of which performance on it is interpreted are 
not authentic, according to Lorge. (c) As to validation in general, 
he points out that the evidence by which this is established is the 
agreement of the Inventory with other personality tests. But, as he 
remarks, it was built upon them in the first place, and independent 
clinical validation is essential. 

This last point in particular has been taken up by other investi- 


266 PSYCHOLOGICAL TESTING 


gators. Landis and Katz, and Landis, Zubin and Katz (g.v.) find 
that the Inventory cannot discriminate between college students 
and hospitalized psychotics. A high score on the various traits 
seems to indicate poor adjustment, but a low score does not prove 
the opposite, chiefly, as they believe, because of dishonest re- 
sponses. That is to say, if a subject honestly makes the responses 
to the items which lead to high total scores on the traits, he will 
reveal his maladjustment. But he can conceal it readily by making 
answers which common sense indicates as “acceptable” or “credit- 
able.” 

Newcomb (g.v.) again in a very interesting field study dealt 
with the problem somewhat differently. His subjects were groups 
of normal individuals living in a camp. Daily records of their 
behavior were made by the camp counselors. Thirty items indica- 
tive of introversion were set up, such as self-confidence, responsi- 
bility, etc., and the actual behavior of the subjects as concretely 
observed over a period of time was rated incident by incident on 
a four-point scale based on these items. Behavior was found to 
have little consistence, i.e., a trait did not manifest itself persist- 
ently in most of the doings of these individuals. No evidence was 
discovered for an introvert-extrovert type. 

One may remark in general that it is just as easy to invent 
personality types and personality traits as to invent faculties, and 
the result may have just as little relationship to psychological 
reality and the actualities of behavior. Nor can the matter be im- 
Proved, and an effective instrument of measurement constructed 
by erecting an elaborate statistical superstructure on the unsound 
foundation of an erroneous concept. { 


5. Humm-Wadsworth Temperament Scale * 


This is 4 profile scale, consisting of 318 questionnaire items. 164 
of them yield scores, and 154 are padded or “dead” items intro- 
duced to create a “test atmosphere” and to influence responses to 
the scored questions. The items are scored in a manner analogous 
to the Bernreuter Procedure, on 7 personality types as follows. 
(1) N, Normal, characterized by self-control, self-improvement, 
and inhibition. (2) H, Hysteroid, characterized by tendencies to 
self-preservation, selfishness, and crime. (3) M, Manic Cycloid, 
Characterized by elation, excitement, and sociability. (4) D, De- 
pressive Cycloid, characterized by sadness, retardation caution, 

* References: Humm and Wadsworth ; Humm,; Mosier, 1938. | 
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and worry. (5) 4, Autistic Cycloid, characterized by shyness and 
sensitiveness. (6) E, Epileptoid, characterized by ecstasy, meticu- 
lousness, and inspiration. (7) P, Paranoid Schizoid, characterized 
by fixed ideas, restiveness, and conceit. Each significant response 
is rated from 1 to 5 on each of these 7 types, and the ratings are 
added to make 7 scores, which are then rated from very strong to 
very weak. 

A commendable feature of this instrument is its use of what is 
probably about as sound a classification of personality types as can 
be had, derived from psychiatric theory and practice. This 
is in contrast to the somewhat vague, and ill-defined cate- 
gories of the Bernreuter Inventory. The items have been selected 
because of their power to differentiate persons known to be high 
in one or other of the type categories. It has been found helpful 
in personnel work, for, of the 2,000 cases chosen as being satisfac- 
torily adjusted on the basis of their test showings, very few were 
later discharged for personality reasons. Poole (9.v.), the medical 
director of the Lockheed plant, where the test has been quite widely 
used, presents a favorable account, though without going into 
detail. 

The instrument is not easy to use, and requires skill and judg- 
ment for a proper interpretation and application of its results. 


6. Minnesota Multiphasic Personality Inventory * 


The Minnesota Multiphasic Personality Inventory consists of 
550 items, each in the form of a simple declarative statement in the 
first person singular, to which responses of true, false, and “can- 
not say” are to be made. Instances are: “I seldom worry about 
my health,” “My daily life is full of things that keep me inter- 
ested,” “IT sometimes feel like swearing.” The basic assumption is 
that the item-responses, when grouped, will form numerous 
Scales. Scales have been developed for hypochondriasis, depres- 
sion, Psychopathic deviate, psychasthenia, hypomania, hysteria, 
Introvert, schizophrenia, paranoiac. Also the inventory yields a 
question score, i.e., a score on the “cannot say” responses, a lie 
Score, and a validity, or F score. The lie score expresses the tend- 
ency to falsify for the sake of making socially approved responses. 
‘This is provided by responses to 15 items which are in effect catch 
questions, such as “I get angry sometimes.” The F score is the 


লা References: Hathaway and McKinley (all entries) ; McKinley and Hathaway 
(all entries) ; Manual; Supplementary Manual. 
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number of agreements between the key and the subject’s record. 
There is also a K scale, which has been devised to accentuate the 
validity of five of the nine scales developed up to the time of its 
publication. The inventory is suitable for persons of 16 C.A. and 
upwards. In its original form it is for individual administration, 
but there is also a group form available. In the individual form 
each item is printed on a separate card, which the subject picks 
up and classifies according to his response. The corners of the 
cards are clipped to indicate the expected response, and to facilitate 
Scoring. 

In constructing the inventory, the authors assembled a pool of 
more than a thousand items from clinical experience, psychiatric 
examination forms, textbooks on psychiatry, directions for Case- 
taking in medicine and neurology, and from earlier personality 
scales. These were reduced to the present number. In developing 
the various scales, the first step was to clarify and to define the 
personality disorder involved. Thus, symptomatic depression is 
defined as a “clinically recognizable frame of mind Characterized 
by a poor morale, lack of hope in the future, and dissatisfaction 
with the patient’s status generally” (Hathaway and McKinley, 
1942, P. 74). The construction of the key for each scale, i.e., the 
item-responses indicative of the various Personality disorders, was 
carried through by comparing considerable numbers of cases of 
each disorder with normals. Additional scales are in process of de- 
velopment. The multiphasic aspect of the inventory has several 
advantages. The same set of items can be used for numerous types 
of personality disorder. The range of items is very large. And as 
new scales are developed, it is possible to re-eval 
viously obtained and existing in old records. 

A number of validation studies have a 
cedure being to check ratings on the inve 
other against clinical diagnosis. Meehl (a. 
a “blind diagnosis” of the records of 147 
i.e., clinical appraisal before testing. 
actual abnormals who were identified, ab 
in their appropriate category, which is 


uate scores pre- 
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there was none. Leverenz (g.v.) makes the point that even though 
the scores do not always corroborate or agree with the clinical im- 
pression, the use of the inventory often directs the clinical investi- 
gation into new and fruitful channels, and makes the clinician 
aware of problems that might otherwise be overlooked. Morris 
(g.v.), on the other hand, reporting on the use of the inventory 
with 320 naval personnel, finds it unsatisfactory when clinical 
evaluations are used as a criterion. It differentiated borderline 
normals from serious psychopaths, but did not aid in differential 
diagnosis among psychopathological groups. His conclusion is that 
the inventory “at its present stage of development . . . cannot be 
regarded as a practical clinical tool, the results of which can be 
accepted as valuable diagnostic aids to the psychiatric member of 
the clinical team” (p. 374). 


-7. Detroit Scale for the Diagnosis of Behavior Problems * 


(This is a rating scale to be used by a trained examiner. It con- 
sists of 66 items under 5 headings as follows. (1) Health and 
Physical factors. (2) Personal habits and recreational factors. 
(3) Personality and social factors. (4) Parental and physical fac- 
tors of the home. (5) Home atmosphere and school factors. Rat- 
ings on the 66 items under these 5 heads are made by means of 
direct observation, scrutiny of medical and educational records, 
and questioning of the child and his parents. They are made on a 
5 Point scale (1 very poor, 2 poor, 3 fair or average, 4 good, 5 very 
£00d)) The rating values are described in detail for each item. 
Thus item 18 is “Home duties.” The question to the child is: 
“What regular jobs do you have to do to help around home?” The 
questions to the parents are: “What regular jobs does he have 
around the house? Does he do them without urging?” The item is 
scored as follows: score 5 for a reasonable number of duties, done 
regularly and willingly ; score 4 if he has to work most of the time, 
fairly willingly ; score 3 if he has some duties but not regular ones, 
with little planning or organization ; score 2 if he is forced to work 
most of the time, with no time for recreation ; score 1 if he has no 
duties, and is encouraged to en any work, or if he rebels and 
absolutely will not accept any duties. 

In OmTmenDE on this item, the authors (Baker and Trap- 
hagen) point out that the child untrained to work in the home is 
apt to be much more immature than others, since work at home 


* Reference: Baker and Traphagen. 
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involves an attitude of mind that carries over into school tasks. 
They cite the case of a silly, noisy, disturbing Negro boy of ten, 
who had no home tasks, and whose adjustment improved when 
Such tasks were organized. Similar elucidations are made in con- 
nection with each of the 66 items. 

‘The items themselves were selected as indicating basic causes of 
behavior difficulties in children, revealed as such by long expe- 
rience. They touch on such matters as time of Sleeping, playmates, 
excitement, shock, uneasiness, vocational interest, broken home, 
attitude towards home. Thus the instrument implements a con- 
irolled personal survey of behavior problem cases, oriented by 
experienced insight into their most characteristic causes. The sum- 
mary sheet shows the rating on each item, the number of items 
rated, and the total score obtained by adding all credits. This lat- 
ter can be transmuted into a letter grade. 


8. Haggerty-Olsen-Wickman Behavior Rating Schedules * 


This is another rating instrument to be used by the examiner, 
and not by the subject. It consists of 2 schedules. Schedule A is a 
behavior problem record. It lists 15 such problems, e.g., defiance of 
discipline, speech difficulties, etc. Each is to be checked 1 to4in 
the appropriate one of four columns according to the frequency of 
its occurrence in the child being rated. Schedule B consists of a 
list of 35 traits divided into 4 groups—intellectual, physical, 
Social, and emotional—each of which is to be rated on a 5-point 
scale. A re-rating correlation of .76 and a split-half correlation of 
‘92 are reported. As to validity, a correlation of .86 with fre- 
quency of referral for discipline is reported, and only 10% of 
normal children equal or exceed the scores made by subjects who 
are psychoclinical cases. The user is cautioned in the manual that 
the results of the schedules should be supplemented by other data 
about the child under consideration. 


9. Logical Decision Test + 


This is an interesting and ingenious attempt to get away from 
the obvious difficulty of tests calling for direct answers to ques- 
tions concerning adjustment and attitude, which is that the sub- 
Ject may falsify his opinions and responses. It consis 
ber of described situations in Which one would hav 


* Reference: Wickman. 
T Reference: Brandt. 
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decision, e.g., finding money or property in a public place and 
deciding whether to leave it, keep it, seek to return it to the 
owner, etc. The subject is asked to consider carefully and to give 
reasons for his decision. He is assured that no adverse judgments 
will be passed, no matter what he decides or what reasons he gives. 
Answers are classified in terms of 6 kinds of goals—self-regard, 
parental approval, friends’ approval, general welfare, objective or 
Practical considerations, social institutions. There has been found 
to be considerable hedging and evasion connected with unethical 
choices. The test is not of major importance, but embodies a not 
unpromising diagnostic technique. . 

By way of a brief comprehensive appraisal of personality tests, 
of which a very large number exist, it may be said that in the great 
majority of cases they are at best experimental. Ellis (q.v.), who 
has made an extensive and thorough analysis of the literature, 
finds that while reliabilities as reported are “notoriously” high, 
validity is usually doubtful. Of 259 studies dealing with the valida- 
tion of group personality tests, So reported positive results, 44 were 
questionably positive, and 135 were negative. For individual per- 
Sonality tests, particularly the Minnesota Multiphasic Personality 
Inventory, out of 15 validation studies, 10 were positive, 3 were 
doubtful, and 2 were negative—a strikingly better showing. Ellis’ 
tabulated findings for five personality tests are presented in 
Table 20. 

‘There has been some criticism of Ellis’ investigation, but it 
Seems probable that its general outcomes, at any rate, are de- 
fensible. In any case they are confirmed by a somewhat similar 
though less extensive investigation by Traxler (1946), who con- 
cludes that “nearly all reputable personality testing outside care- 
fully controlled clinical situations is still frankly tentative and 
experimental” (p. 424). And Kornhauser (1045 b), in his poll of 
the opinions of psychologists, reports that of 79 who replied, 1.5% 
ound personality inventories highly satisfactory, 13.5% found 
them moderately satisfactory, and the rest ranged from doubtful, 
through rather unsatisfactory, to highly unsatisfactory in practical 
Use. As to the reliability in general of questionnaire responses, 
Cuber and Gerberich (g.2.) briefly report one of the few studies 
made on this topic. They gave 60 questions from the Bell Inven- 
tory and various Thurstone attitude scales to 132 students in social 
Studies courses, giving them three times in all at intervals. Seventy- 
two per cent of the responses were consistent. Interestingly, 
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TABLE 29 
VALIDITY STUDIES OF PERSONALITY QUESTIONNAIRES 
(Ellis, p. 424) 


Number 
Number | Number of Number 
of of Ques- of 
Test Employed Times | Positive| tionably | Negative 


Em- Valida- | Positive | Valida- 
ployed tions | Valida- | tions 


tions 
Bell Adjustment Inventory ...... Fo] FY [) II 
Bernreuter Personality Inventory 29 9 6 14 
Thurstone Personality Schedule... 10 4 3 5 
Woodworth Personnel Data Sheet 29 II 4 14 
Other Personality Tests ........ 82 40 IS 27 
TUNE sansa one sania assase| 162 65 26 TI 


responses to factual questions were less consistent than those in- 
volving attitudes and evaluations. 


MEASURES OF INTEREST 

The measurement of interest has proved a decidedly more man- 
ageable problem than the measurement of personality. A goodly 
number of successful instruments for this purpose have been de- 
veloped. This is because the basic concept itself is much more 
clearly definable. An interest may be described as a tendency to 
make consistent choices in a certain direction without external 
pressure and in the face of alternatives. Interests as so understood 
are, within limits, observable. Moreover, an individual can make 
reasonably accurate and dependable verbal reports about his in- 
terests, which to him mean his preferences. 

Furthermore, a good deal of well- 
tigation has been devoted to the 
years. This has made possible the 
and scales with a real psychologic. 


oriented and sequential inves- 
topic of interest through the 
Construction of interest tests 
al content and a constructive 
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and intelligible psychological meaning. The main points which 
have come to light may be summarized as follows: 


1. Interest and success 


The relationship between interest in any line of activity, as 
expressed either in words or actions, and success in that activity 
is obviously of great importance. In order to understand what 
that relationship is, a twofold distinction must be made. 

A. The relationship between interest in any activity and objec- 
tive success in it as compared to other people is doubtful. This is 
Particularly true when expressions of differential interest in rather 
Similar lines of activity are elicited; as, for instance, the degree 
of interest a person feels in school courses of academic type (v. 
Bridges and Dollinger). Such expressions of differential interest 
Seem to have little relationship to relative success. The relation- 
ship becomes closer if a wider range of school courses is con- 
sidered, e.g., manual training, art, music, and so forth, as well as 
academic studies (v. Thorndike, 1921). If the range of differences 
between preferred activities is still further extended, and there is 
Choice, let us say, between intellectual and social doings, then 
Preference begins to be of some significance in predicting success 
(Wyman). This finding has been confirmed in a number of places. 

B. If the pattern of a person’s expressed interests is compared 
With the pattern of his own abilities within himself rather than 
With his achievement with reference to other persons, then the 
relationship turns out to be high (v. Thorndike, 1917; King and 
Adelstein). This is perfectly understandable. One may be greatly 
interested in some line of activity and yet be excelled by many 
other and still more capable persons. But the existence of the 
interest may still indicate one’s own best capabilities. 

These early results seem in general consistent with the extensive 
and extremely important work of Strong. He regards interest as 
What he calls an “indeterminate” indicator of success. That is to 
Say, interest tends to be associated with success, but not directly, 
Since both are affected by many other factors. However, in the 
Case where interest continues over a lapse of time, the relationship 
1s closer (Strong, 1943). However, as Carter (1944) points out in 

1S summary of ten years of work in this field, the criteria of 
Success in any vocation are not adequate, and this makes the 
Validation of scales and tests for the measurement of interest and 
the detection of interest patterns difficult in terms of success. The 
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relationship of interest to satisfaction seems closer and more 
determinate than its relationship to success. 


2. The permanence of interests 


This latter relationship between interest and ability or success 
is greatly strengthened if the interest in question is of long stand- 
ing, as, for instance, if it can be shown to exist through the 
elementary school, the junior high school and the senior high 
school periods, and into adult life (Thorndike, 1917). Moreover, 
it has been shown that interest patterns tend to grow more stable 
as life advances. Certain youthful interests, to be sure, tend to 
fade out, and others take their place. Such doings as active out- 
side amusement or the reading of fiction are apt to lose their 
appeal, and quieter occupations to be substituted (Thorndike, 
1935). Yet Strong (1937) has shown that in general the things 
most liked at the age of twenty-five are liked more some decades 
later on, and the things least liked at twenty-five are less liked 
later on. So again in 1943 he reports that the interest patterns re- 
vealed by interest scales are highly permanent, and little influenced 
by training and experience in the occupations concerned. Carter, 
too, finds that the vocational interests of high school pupils are 
incompletely developed, but highly individual, definitely pat- 
terned, and “much more reliable and permanent than earlier 
studies would indicate” (1944, Pp. 68). The inferences for the practi- 
cal problem of measuring interest are obvious. With very young 
subjects the significance of any such ratings is dubious. But with 
older persons they may very well indicate a permanent life trend 
significantly related to the individual ability pattern. 


3. Interest groups 


The findings reported so far establish nothing more than that 
a reliable interest rating, if it can be obtained, is likely to have 
considerable significance, and that if feasible psychometric instru- 
ments can be devised in this area, they will be well worth while. 
But the decisive point for the measurement and evaluation of 
interests has been the establishment of the fact that there are 
fairly well-defined interest groups. These are groups of persons 
fairly homogeneous in interest, and differing significantly in this 
respect from other persons. 

Lewis and McGehee (g.v.), for example, have shown that there 
are significant differences in interest as between bright and dull 
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children. The comparative interest patterns of these two groups 
are shown in Table 30. The data repay careful scrutiny. It will be 
seen that the high differentiations occur with reading, sports and 
games, playing musical instruments, dramatics, and collecting, 
and that others are definitely significant, and also that the superior 
subjects had many more hobbies than the retarded. 

The establishment and analysis of differential interest groups 
has been carried to a considerable length, most notably by Strong 
(q.v.). Thus, persons in each of a large number of occupations and 
families of occupations are known to exhibit characteristic inter- 
est patterns. A young man is likely to enjoy an occupation when 
his interests harmonize with those of adult workers in it. More- 
over, the successful persons in occupations exhibit the character- 
istic interest patterns with peculiar definiteness. Again, itis known 
that persons in different educational curricula tend to exhibit 
differential interest patterns, though these are not so clear-cut as 
those in different occupations. It may be that different social 
groups also exhibit differential interest patterns, but so far this has 
not been determined (v. Fryer, 1932; Strong, 1931, 1943). 

Strong (1043, v. ch. 8) has brought to bear the techniques of 
factor analysis upon the study of interest. He has isolated four or 
five factors, and on this basis has established eleven groups of 
men’s occupations, and ten groups of women’s occupations. 


\/ INSTRUMENTS OF MEASUREMENT 


All this work clearly opens the way for the development of 
Instruments of measurement. A scale can be devised and stand- 
ardized in such a way that it will not merely elicit whatever 
preferential interests a person may happen to have, but will show 
their relationship to the characteristic interest pattern of this or 
that occupational or educational group. When this is properly 
done, the result is an instrument of very considerable value for 
guidance and appraisal. Once we recognize that an individual’s 
established interest pattern is related to the pattern of his own 
abilities, and furthermore that successful persons in various func- 
tional groups exhibit characteristic interest patterns, it is manifest 
that we have the basis for highly significant interpretations and 
Prognostications. As Bingham (g.v.) remarks, interests properly 
and frankly formulated in a situation where the stwbject is not 
deflected by considerations of what he ought to say to make a 
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TABLE 30 


PERCENTAGES OF SUPERIOR AND RETARDED BOYS AND GIRLS DESIGNATED 
AS INTERESTED IN VARIOUS HOBBIES 


(Quoted from Lewis and McGehee, Table 2, p. 598) 


Boys GIRLS 
HoBBIES 
Superior | Retarded | Superior | Retardec 

Reading novels ......ccee 50 23 60 31 
Reading history and science 3I 9 22 9 
Reading funny papers..... 49 39 50 41 
Active games and sports... 67 54 42 38 
Quiet games . 5. 20 ++ 26 I5 29 24 
Playing musical instruments 22 10 28 II 
Listening to the radio 39 30 37 29 
Sewing, knitting ..... ্্ড সু 4 36 34 
Housework ...... eee 7 5 32 40 
Going to Shows cesses 33 30 34 29 
Dramatics, participation .. ন 4 16 6 
Make-believe games ...... 9 6 24 16 
Religious activities ....... 17 II 21 IS 
Building things, shopwork. . 34 27 4 3 
TAVERN a amo tase tim cee I3 8 II 8 
Driving car ... 7 9 3 3 
SEEING: ocmsat ise 9 4 II 6 
Working, farm, store. ..... Io Fr ঙ 5 
Clubs— social, dance ...... 4 2 9 6 
Scouting or other serious 

club activity . 13 6 10 4 
Collecting... ি 30 9 22 1 
NODE ss aaw nee ce ্্ f 14 I 12 
MUMBA ans 60d santa 1700 5009 2505 1618 


good impression or by ideas about prestige, are a highly mean- 
ingful sign. Such, then, is the basis on which the measurement of 
interest proceeds. We now turn to a discussion of some repre- 
sentative instruments. 
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I. Interest Questionnaire for High School Students * 


This questionnaire consists of items to which the subject re- 
sponds by indicating liking, indifference, or dislike. It falls into 
8 sections. There are 68 items which have to do with occupations 
which the subject likes, dislikes, or regards with indifference; 2 
having to do with activities; 20 having to do with school subjects ; 
20 having to do with job activities; 47 having to do with school 
activities; 12 having to do with prominent men; 26 having to do 
with things to own; 23 having to do with magazines. The items 
were selected in terms of their ability to differentiate the interest 
patterns of students in academic, commercial, and technical 
courses. There are 3 keys which score the subject's responses on 
the basis of similarity to these three interest patterns. It is re- 
ported that the questionnaire can predict success in the curriculum 
of the subject’s choice more accurately than it is predicted by a 
general intelligence test. The instrument is carefully and com- 
petently constructed, and is a good sample instance of others of 
the same general type. 


2. Vocational Interest Blank for Men (Strong)t 


This is by all means the most important and highly developed 
of all instruments for the measurement of interest. It has been 
revised and extended from time to time and has been widely used. 
The second revision, now available, is the outcome of twenty years 
of work and experience. 

The blank consists of a lengthy and elaborate questionnaire. 
It lists 100 occupations, 38 amusements, 36 school subjects, and 
Contains 46 items having to do with types and peculiarities of 
people. To these the subject responds by indicating liking (L), 
indifference (1), or dislike (D). In addition, the blank calls for 
self-ratings on various preferences, habits, and traits which were 
selected because they differentiated between the interests of a 
large variety of occupational personnel. Its outstanding feature is 
its variable scoring, in which it resembles the Interest Question- 
naire for High School Students just described, but carries the 
Principle much further. Norms for the various scoring schemes 
Were set up based on the declared interests of persons successful 
in various occupations and of those of a large group of “men in 


* Reference: Symonds, 1930. 
tT Reference: Strong, 1943 
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general,” this sample being chosen to conform to the occupational 
distribution of the United States Census. Scoring keys have been 
worked out based on the interest patterns of 35 occupations, 6 
groups of occupations which are psychologically similar, and for 
3 nonoccupational traits, namely maturity of interest, mascu- 
linity-femininity, and studiousness, and also for occupational level. 
Thus the item responses made by any person can be interpreted 
in terms of their relationship to a large variety of significant inter- 
est patterns. 


PREFERENCE RATING INTEREST GROUP OR 


ee L ঠ BD TRarT 
Electrical engineer 2 —-5 5 Advertiser 
— 3 Masculine-feminine 
Displaying merchandise | —2 I 2 Advertiser 
in store —-2 + I 2 Advertiser 
—2 I ৮ Masculine-feminine 
Writing reports 2 —I —I Personnel manager 
3 I —I Accountant 


Fic 26. SAMPLE RESPONSES TO VOCATIONAL INTEREST BLANK FOR 
MEN INTERPRETED FOR VARIOUS OCCUPATIONS AND TRAITS 


How the scoring scheme works out can perhaps best be under- 
stood from an examination of a concrete sample such as that 
Presented in Figure 26. An expressed liking for the occupation of 
electrical engineer gives a positive score of 2 in terms of the 
interest pattern characteristic of advertisers, and a positive score 
of 4 in terms of the interest pattern associated with the trait of 
masculinity-femininity. An expressed indifference for this occu- 
pation gives a Score of —5 on the norm for advertisers, and is not 
related to the trait. An expressed dislike for this occupation gives 
a positive score of 5 on the norm for advertisers, and a score of 
3 on the trait. A Person’s total interest score is the sum of the 
positive and negative Values on all items, as rated in terms of the 
Occupation, or occupation SroUup, or trait concerned. A person can 


have as many total interest Score: 
S as there are norms and keys 
for the Blank. At present this number is 44. f 
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A split-half reliability of .87 has been reported for 285 students 
at Stanford University, and also a retest reliability of .869 after 
the interval of a week. The basic claims as to validity, which can 
be substantiated, may be summarized as follows. (a) The Blank 
discriminates sharply between those who are successful in a given 
category, and “men in general.” Thus in one investigation it was 
found that only 15% of nonengineers rated A in engineering in- 
terest (i.e., in conformity with the interest pattern of successful 
engineers). (b) Interest scores and patterns correspond well with 
vocational success. Thus of 181 life insurance salesmen who rated 
high to medium in this interest category, 67% wrote at least 
$150,000 worth of insurance a year. (c) Personnel experience 
amply validates the Blank. It sometimes misfires, and sometimes 
it is resented and disliked. But in general it is an excellent index, 
Particularly when combined with other criteria. 


3. Vocational Interest Blank for Women (Strong)* 


The same techniques of construction, and the same general 
Organization are embodied here as in the Vocational Interest 
Blank for Men. Norms and keys are available for 17 occupations, 
and for the trait of masculinity-femininity. Some special dif- 
ficulties were encountered in connection with this instrument. 

Omen tend, to a greater extent than men, to enter and stay in 
Various occupations for reasons other than interest. Hence the 
relationship of the interest pattern to success is not so clear. 
Again, the Blank was standardized on mature women, and this 
Casts some doubt on its use for younger women, as the norms 
developed may not apply well. Finally, the occupations used in 
the scaling were not homogeneous in all cases. 


4. Occupational Orientation Inquiry T 


‘This and the example that follows are cited as samples of a 
different type of instrument from the foregoing, or at least of 
Instruments whose primary serviceability is different. The Occu- 
Pational Orientation Inquiry calls upon the student to give a 
review account of his vocational interests and experiences, and 
then to make a self-rating on 224 occupations in terms of his 
knowledge of each, his interest in it, his ability in it, and his 
Chance of placement in it. These ratings are made on a 5-point 


* References: Berman, Darley, and Paterson; Bingham ; Strong, 1943. 
T References: Wellar; Peatman. 
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scale. As a means of presenting in summary and comprehensive 
form a young person’s broad vocational orientation, it is valuable. 
Many similar questionnaires have been prepared, but the par- 
ticular advantage of this one is that it calls for self-ratings under 
the four heads of knowledge, interest, ability, and opportunity. 
This makes it of distinct service for guidance, as may be inferred 
from one report to the effect that 78% of a large group of high 
school subjects had virtually no vocational knowledge but hoped 
to be able to find employment. The authors have attempted to set 
up norms based on interest groups, but these do not appear to be 
well founded. Thus the Inquiry cannot safely be used for dif- 
ferentiation on the model of the Interest Blank for Men, but it 
can be and is a serviceable tool for the counselor. 


5. Miner’s Analysis of Work Interests * 


This is an old, but still valuable, and indeed excellent instru- 
ment. It consists of a four-page folder containing numerous ques- 
tions pertaining to vocational interests, and the subject is invited 
to reflect about them carefully before he answers. Its distinctive 
feature is its emphasis upon reflection, and the explicit intention 
that it should be used as a preliminary to a conference with the 
counselor. 

These last two instruments, though practically valuable, are 
not of any wide psychometric interest. They are included here 
chiefly to give the reader an idea of the very numerous similar 
examples that are available. 

So in general, in the measurement of interest we have one of the 
most successful fields of psychometric endeavor. The reason for 
this is evident. It is the possibility of formulating clear-cut and 


meaningful concepts, and of translating them into items which 
are at once viable and significant. 


6, Kuder Preference Record + 


The Kuder Preference Record consists of 14 sets of 3-Choice 
items. The subject is instructed to indicate which he likes least 
and most by punching holes in the appropriate positions. An 
instance of such an item is: “Visit an art gallery: Browse in a 
library: Visit a museum.” This is a modification of the earlier 
plan, under which the subject was instructed to indicate a prefer- 


* References: Bingham; Miner. 
1 References: Kuder, 1946; Super. 
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ence between two situations. Kuder (1939) found that such ex- 
pressions of preference had sufficient stability to be used as test 
items. There is no time limit, but the time required is usually 
about 40 minutes. 

The scores on the record are classified in terms of the following 
nine areas: mechanical, computational, scientific, persuasive, 
artistic, literary, musical, social service, and clerical. Lists of 
Occupations are presented under 89 combinations of the various 
areas. To show the relation between occupations and areas, the 
mean scores of two occupations in the nine areas are presented 
in Table 31. j 

Mean profiles such as those shown in Table 31 are provided 
for a number of occupations. It has been remarked that the num- 
bers involved in determining them are quite small, but the occupa- 
tions differentiate according to expectation. Thus, accountants are 
high in the computational area, and actors in the artistic, literary, 
and musical areas. Preference profiles have also been found con- 
sistent with college curricula. There is a significant, but not high. 
relationship with school achievement. The highest correlation re- 
Ported is .419, for the science scale (area) with general science 
achievement for women. There seems to be little relationship be- 
tween the Kuder scales or areas and the primary mental abilities 
revealed in the test by Thurstone. 

It has sometimes been said that the Kuder scales were developed 
@ priori, i.e., that the items indicating scientific, computational, or 
Persuasive preferences were designated largely by speculation and 
inspection. However, the manual shows that they are based on in. 
ternal statistical consistency and mutual independence. As com- 
Pared with the Strong Interest Blanks, the Preference Recor¢ 
Used only one type of item, ie., the vocational activity type, 
Whereas Strong uses many types. The Kuder Preference Record 
Succeeds better than the Strong Interest Blank for Women in 
differentiating the vocational interests of women. 

The Prefererice Record has been shown to be reliable enough fot 
Counseling. There is, moreover, a fair agreement between Strong 
and Kuder results (v. Triggs). However, the two instruments are 
sufficiently divergent in method and purpose so that one cannot 
be used to validate the other, nor is either one a substitute for the 
other (Super). The Preference Record has been found difficult tc 
Use with gth graders, due to lack of comprehension of the language 
employed (Christensen). A number of studies have appeared re 


~ EE ~~ 
blz I¥'79 98°0z cgi TES 1L°£9 bh6L 9S8°7S 16°96 FE **** sI0PIAV 
0S'5S obrfL SL61 o0'L9 os'6£ SigL 00°gs S67 of'Ls Oz °**** s1oAMtT 
2910499 4 ans 10uong 109 HggNAN  danoun 
PUD 11905 TERE | NEE | STF -Dns19d oYeuans -Dingduwog | -1u0y299 IN. OLL 
-van990 


Sgu0OIS NYaJN 


(11-01 ‘dd ‘ze alqreL :9F01 ‘soyenossy YIvasao AUIPS ‘oBLIND ‘pA072y 29u910faaq 2141 fo WnuDL ‘I'D ‘pny wo) 
SNOILVANIIO OAL YUO4 SONILVY AONdAUSIIUd NVA 


I6 JTAVL 


PSYCHOLOGICAL TESTING 


282 


yt 


TESTS 283 


porting normally expected relationships between occupations and 
preferences, but such relationships do not always appear (v. 
Bolanovich and Goodman). J. A. Lewis (g.v.), using as his sub- 
jects 50 male insurance agents and 5o female social workers, in- 
vestigated the relationship between the Preference Record and the 
Minnesota Multiphasic Personality Inventory. He reports that 
those relatively uninterested in their occupations tend to make 
more abnormal MMPI scores. 


MEASURES OF ATTITUDE 


The concept of attitude has been understood in at least two 
somewhat different senses, both of which have had an influence 
upon the construction of psychometric instruments. According to 
Thurstone and Chave (g.v., pp. 6-7), an attitude is “the sum 
total of a man’s inclinations and feelings, prejudice or bias, pre- 
conceived notions, ideas, fears, threats, and convictions about any 
specific topic.” On the other hand, it is often considered and 
treated as an underlying disposition towards overt action, a per- 
vasive orientation towards life (e.g., Vernon and Allport). Numer- 
ous attempts have been made to set up scales in terms of both 
concepts, and it would not be hard to find intermediate or com- 
Promise examples. 


1. Specific Attitude Scales 


Scales intended to measure or register attitudes on some specific 
topic are constructed by a number of different procedures. 

1. First, there is the so-called method of equal-appearing in- 
tervals. This is sometimes called the “Thurstone” method, but 
the designation is not very fortunate, although Thurstone and his 
associates have been specially prominent in utilizing it (v. Thur- 
Stone and Chave). 

. An attitude scale of this type consists of a number of statements 
intended to embody and reveal the presence of the attitude in 
question in varying degrees. For instance, a scale for the measure- 
ment of attitude towards the church may contain such statements 
as the following: “I would rather go to church than do anything 
else” ; “J like to go to church’; “Going to church does no harm”; 

Going to church bores me”; “I would never set foot inside a 
church.” Clearly these statements indicate a range from extreme 
Preference to extreme rejection on the specific matter of church- 
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going. The arrangement of such statements so that score values 
can be attached to them then becomes the manifest problem. This 
is done by submitting them to a jury, the members of which pro- 
ceed to rank or rate them in order according to some simple plan 
of work. In the instance just given, the statement “I would rather 
go to church than do anything else” ranks in positive value above 
the statement “I like to go to church,” and thus receives a higher 
positive score. The scale value or score value of each item or 
statement is the central tendency of the jury ratings. Since there 
is rarely complete agreement among the jury as to the indicative 
significance of such statements, the value of each is to some extent 
ambiguous, and the ambiguity of the item is measured by the 
spread or dispersion of the ratings. Also, it often happens that 
persons will accept a given statement, and in addition accept some 
other statement far removed from it in scale value. This involves 
an element of inconsistency, and the degree of inconsistency found 
in any given item is measured by the manifested tendency for 
Such inconsistent ratings to appear. On this basis the statements 
are selected, and then arranged in a linear order to constitute the 
scale. Any person's attitude towards the topic in question, all the 
Way from high positive to high negative, is determined and given 
a numerical score by his checking what scale statements are 
acceptable and unacceptable to him. The score value of the state- 
ments, it should be explained, is calculated in terms of standard 
deviations of the dispersion of jury ratings. Thus a statement which 
Secures a mean jury rating of 2.5 S.D. above the mean for all 
statements would be given a high standard score, and so for all 
other score determinations, mutatis mutandis. 

2. Another method of constructing attitude scales is known as 
the method of summated ratings. Just as the technique of equal- 
appearing intervals has been especially associated with Thurstone, 
50 the latter proced metimes designated as the “Likert” 


H em in terms of five choi 
approve, undecided, disa 
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that the scale using summated ratings is no less reliable and valid 
than the “Thurstone type” scale (v. L. W. Ferguson ; Farnesworth, 
1945; Riker). In fact, the idea of being able to locate a statement 
at some designated point (usually the midpoint) of a continuous 
linear series has seemed difficult and questionable to some investi- 
gators. 

3. The work of Guttman (1947) represents a further refinement 
in attitude scale construction and evaluation. He introduces a 
criterion of scalability, on the basis of which it is found whether 
and to what extent we can reproduce a person’s response to all 
items from his response to one. For instance, if on one scale item 
60% agree, 10% are undecided, and 30% disagree, then the highest 
60% of individuals classified on total scores must be those who 
register “agree” on this item, or there is imperfect scalability. This 
leads to a technique for combining and sifting items to improve 
the scalability of the instrument, which also makes possible a re- 
duction in its length. Some other investigators have found Gutt- 
man's technique difficult if not impractical. A further important 
contribution of Guttman has been to recognize and provide means 


for measuring the intensity with which an attitude is maintained 
by a subject. 


2. Generalized Attitude Scales * 


Remmers has pointed out that while Thurstone’s technique is 
Sound, it is also very laborious. In particular, it opens the way to 
the construction of almost innumerable specific attitude scales. 
He has proposed to retain the rigor of the method and avoid the 
practical drawback of scaling enormous numbers of specific atti- 
tudes by constructing general attitude scales. A very considerable 
number of these, also, have been developed; and since once more 
they are all similar in principle and technique, no one is selected 
for discussion here. Generalized attitude scales constructed 
by Remmers and his associates will be found listed in the bibliog- 
raphy of tests at the end of the book under the heading Attitude 
Scales. 

A generalized attitude scale, according to the explanation offered 
by Remmers, consists of affective statements or stereotypes, all of 
Which apply validly to a psychological continuum representing 
attitudes towards a large body of objects, such as nations, races, 
institutions, vocations, political parties, and the like. In plainer 


* References: Remmers, 1934 b, 1934 ¢, 1936, 1938, Remmers and Silance. 
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language the idea is to develop a scale that will measure attitude 
not to the Negro race specifically, but to any race; not to this or 
that school subject, but to school subjects in general, or rather to 
any school subject. The purpose is, of course, to reduce the labor 
of construction and administration and still have instruments that 
can be widely used. 

The pioneer scale of this type was for the measurement of atti- 
tude towards “any school subject.” It contained such statements 
as the following: “I look forward to this subject with horror”; “J 
have seen no value in this subject”; “ I don’t believe this subject 
would do anyone any harm”; “This subject is all right”; “This 
subject is a good subject”; “I really enjoy studying this subject”; 
“No matter what happens, this subject comes first” (see Remmers 
and Silance). 

We now turn to consider three representative attitude tests 
Which are based on the conception of attitude itself, not as a Sys- 
tem of evaluations with reference to some specific Phenomenon ot 
theory, but as an underlying and pervasive disposition in the 
individual that shapes and colors his reactions to life. 


3. A Study of Values *# 


‘This test is divided into 2 parts. (1) is a set of items calling 
for statements of preference as between two fields of activity. Each 
item is to be answered yes or 10, and the degree of preference is 
to be indicated on a scale from o to 3 points. Such preferential 
Statements as that business should be operated for profit rather 
than for service, or that scientific research should be for the dis- 
covery of truth rather than for practical applications are set Up. 
(2) is a set of 4-Choice items, each to be ranked in order of 
preference ; for instance, that the function of government is pri- 
marily to relieve the needy, to develop business, to make politics 
more ethical, or to achieve power. These items are scored on a 
differential scheme, which yields a profile showing the importance 
the subject attaches to 6 kinds of Values; namely, theoretical, 
economic, aesthetic, social, Political, and religious) 

Cantrill and Allport (9.0.), in a study of this 5nd similar tests, 
report that the scores for social values are the least reliable and 
discriminating. But they claim to show that the other items, when 
Scored in terms of the indicated norms, “select consistent, per- 


* Reference: Vernon and Allport. 
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vasive, enduring, and above all generalized traits of personality.” 
Their importance, it is claimed, lies in the fact that a person’s 
behavior is not determined by the immediate situation, or by 
transient interest, but by general evaluative attitudes to which the 
term values is attached. | 

As E. B. Greene (g.v.) remarks, this latter contention may be 
correct, but when the attempt is made to translate it into a psy- 
chometric instrument based on a well-defined concept or concepts 
Which focalize a set of test items, very serious doubts arise. Can 
such an instrument truly reveal these long-term, deep-seated, 
orienting trends in the personality? In the present instance each 
of the six value categories set up seems to be a collection of 
superficially relevant items rather than a clear unitary trend. 
Theoretical values include concern with the discovery of natural 
laws, mathematical relationships, and scientific facts. Economic 
Values are embodied in items having to do with activity in real 
estate, finance, industry, and vocational training, which certainly 
Seem a heterogeneous collection. Social values are embodied in 
items having to do with a sense of responsibility to others and 
their needs, with unselfishness and sympathy, some of which might 
well overlap religious values. Political values are discriminated 
by items having to do with government and political affairs, the 
exploration of the world, and the acquisition of professional and 
Social prestige, ithe internal relevance of which is far from clear. 
Religious values are supposed to be revealed by items dealing 
with the abolition of war, laying up treasure in Heaven, reverence 
in church, belief in God, and the evaluation of life as a whole. 
Tt is clear that the score on each of the 6 values is obtained by 
Tesponses to a confused jumble of items, held together rather on 
the analogy of material in a filing system than in terms of true 
Psychological coherence and unity of meaning. Nor are the in- 
ternal consistencies of the score-patterns for the various values 
2s reported by Allport and Vernon very reassuring. 
. This, as might be well expected, is the prevailing weakness of 
Instruments of this kind. Universal life values are presumably 
very important, though just how consistently people are motivated 
by them may be a question. But to uncover them by means of a 
Psychometric instrument is a difficult if not an impossible task, 
because although we may know what they mean well enough for 
moral discourses, they are not defined with sufficient clarity for 
mental measurement. 
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4. Test of Public Opinion * 


This is an interesting variant on what has already been dis- 
cussed, not so specific as the attitude scales, not so Sweeping in 
its claims as the values tests. It undertakes to reveal prevail- 
ing attitudes on social and public matters. There are 6 subtests. 
(1) Crossing out from 5r items every word that arouses more 
disagreement and annoyance than agreement and attraction. (2) 
Rating 53 statements on degrees of truth. (3) Paragraphs followed 
by a number of short statements, to be checked as to whether or 
not they are logical consequences of the paragraph. Prejudice is 

“ supposed to be revealed by including some statements that do not 
follow logically. (4) Approving or disapproving on moral grounds 
various described situations. (5) Distinguishing between strong 
and weak arguments. (6) Generalization test much like (2). Scor- 
ing standards were worked out on the basis of jury ratings by a 
group of judges. The scores purport to reveal prejudice or bias 
along 12 lines, i.e., for or against radicalism, capitalism, economic 
liberalism, social gospel, personal communion and mysticism, fun- 
damentalism, Christian modernism, religious radicalism, protes- 
tantism, Roman Catholicism, puritanism, libertinism. 

Quite a large number of ingenious tests of this order have been 
produced (cf. Murphy and Likert), but their validity is open to 
very grave doubt. They deserve attention from students of psycho- 
metrics as examples of test construction rather than as instru- 
ments for serious practical application. 


5. Pressey Interest-Attitude Tests + 


This instrument can probably be treated as appropriately in 
connection with the general category here under discussion as any 
other, although to be sure it has some of the aspects of an interest 
test. It is intended for ages from grade 6 to the adult level. Accord- 
ing to its authors, it provides “a simple and expedient way of 
investigating the maturity of the interests and attitudes of a group 
with respect to a large number of items.” It consists of 4 Subtests 
each of go items. (1) Things the subject considers Wrong. (2) 
Things that interest him. (3) Things he worries about: (4) Char- 
acteristics of persons he admires. The response required is simple 
consisting of putting an X in front of words arranged in columns. 


* Reference: Goodwin B. Watson. 
1 Reference: Pressey and Pressey. 
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If he has very strong feelings he can put two X's. The items, 360 
in number, were very carefully chosen from a list of 950 on the 
basis of their power to differentiate between younger and older 
subjects. Norms for boys are reported based on 2,000 Cases, and 
for girls, on 2,088 cases. The split-half reliability of .94 to .96 
has been reported for the whole test for single grade groups, and 
of about .85 for the various subtests. Validity appears to be 
reasonably satisfactory, for the test correlates with estimates of 
emotional maturity made by guidance workers .66 to .72. 

‘To sum up, then, specific attitude scales, and generalized attitude 
scales developed as efficiency devices, have proved feasible and 
Satisfactory. The same cannot be said of tests designed to reveal 
deep-seated and controlling values. The reason is clear. The more 
definite the controlling concept, the better the instrument. 


MEASURES OF MORAL CONDUCT AND CHARACTER 


A very considerable number of tests and scales which attempt 
to measure various aspects of moral conduct and character are 
available. As good representative examples as can be found are 
the C.E.I. (Character Education Inquiry) Tests.* Instead of de- 
scribing some or all of them in detail, some of the more ingenious 
items and devices embodied will be presented here, so that the 
reader may form some notion of what is done in work of this kind. 
A full list of these tests appear in the bibliography of tests at 
the end of the book. Much in the following brief account is based 
Upon the synopsis presented by E. B. Greene (gq.v.). 

One of the sub-batteries of this extensive set of tests has to do 
with Moral Knowledge and Opinion. Characteristic items em- 
ployed in the separate tests in this category are as follows. There 
is a set of true-false statements which are intended to measure 
Cause and effect in the moral realm, as for instance, “God pun- 
ishes bad people by making them sick.” There is a set of multiple 
Choice recognition items, in which the subject is to indicate 
Whether certain described actions are classifiable as cheating, or 
lying, or stealing, or as some other kind of offense, or not wrong 
at all. There is a vocabulary test, in which the meaning of a list 
of words with moral connotations is to be indicated. There isa 
Set of so-called “free response” items, in which a situation with 


* References: Hartshorne and May; Hartshorne, May, and Maller; Harts- 
horne, May, and Shuttleworth. 
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moral implications is described, and the Subject is to write down 
whatever he thinks might happen, being scored on the number of 
important or probable consequences indicated. The same situa- 
tions are also used in a test which requires the subject to say 
whether certain stated consequences are Probable, possible, or 
would not happen at all. 

Another sub-battery is made up of tests having to do with 
conduct, and specifically with the qualities of honesty, 
and inhibition as expressed in conduct. For the measurer 


traits is provided, with Concrete characterizati 
ঢ ation: i 
acts to be applied to each Child by the tea * Of cooperative 
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As to inhibition and persistence, the following devices have 
been invented. An exciting story is read up to the climax, at which 
point the ending is set up with capital letters entirely and words 
run together, with both small letters and capital letters and with 
words run together, and with spaces wrongly placed between the 
capital letters, the task being to separate the words with pencil 
marks to facilitate reading. A page of pied type is presented, the 
task being to count the letters. As a distraction test, lines of digits 
to be added are printed among curious pictures and lines, the 
score being the difference from adding time with normal presenta- 
tion. As an inhibition test, various temptations are applied, such 
as presenting a piece of candy not to be eaten until a given time, 
or a small safe also not to be opened until a given time. Also a 
check list of words indicating types of self-control is provided for 
ratings by teachers. 

There are also various tests of moral opinion, including lists of 
items to be checked if they indicate duties, described situations 
in connection with which the subject must say what he would do, 
and tests involving the relative importance of various moral 
Principles. 

One is impressed with the ingenuity with which the item con- 
tent of these tests has been fabricated. Indeed, they are full of 
devices—one might almost say of gadgets—which the psycho- 
metric worker may well find interesting and even suggestive and 
helpful. Tests of moral conduct and character seem very much 
calculated to elicit the creative urge of makers of instruments of 
mental measurement, and few such tests have exemplified this 
Detter than those now under discussion. The intercorrelations of 
this battery are low, so it cannot be regarded as measuring some 
broad trait of moral excellence. Moreover, the relationship be- 
tween what a subject says he would or should do in a described 
situation and what he actually does in a real one is very uncertain. 
F. N. Freeman (1939) has reported negative correlations between 
the ability to say what is the right thing to do and the actual tend- 
ency to do it when the action concerned is undesirable, the 
coefficients ranging from —.I13 to —.44 for individuals and groups. 
However, he reports positive correlations of .23 and .53 for verbal 
decisions and actual behavior in the case of desirable actions. 

Clearly we have here yet another instance of a tempting field 
for psychometric research and endeavor, in which at the same 
time little success has been achieved because of the lack of 
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hich viable and Ssig- 
ble and definable concepts about w 

Pl chats Lesh items can be organized. For the general statement 
ay be made that tests of moral conduct and character are of 
very doubtful validity. 
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Pear in this chapter? How might they be reclassified? What would be 
the advantage of so doing? 

3. Examine the list of interests of bright and dull children in 
Table 30. Might a test be set up on this basis that would discriminate 
between the bright and the dull? 

4. Critically examine the concept of the personality quotient, and 
ne procedure for computing it, in comparison with that used for 
the I.O. 

5. In a case of practical group or individual guidance, would you 
expect an interest test to contribute anything not derivable from an 
intelligence test? Would the opposite be true, i.e., a contribution from 
the intelligence test not derivable from the interest test? In each case, 
what, if anything, would such a contribution be? 

6. Do you find any tests here listed, besides the Miner Analysis of 
Work Interests, that might be very useful if employed in connection 
With conference or discussion, but perhaps not of much use if em- 
Ployed simply to obtain a score? 

7. Why might it be impossible to measure by psychometric means 
the dominance and pervasiveness of religious values, and yet possible 
to measure rather definitely a person's attitude towards God? 

8. Consider carefully the relative excellence of the Bernreuter Per- 
Sonality Inventory and the Minnesota Multiphasic Personality In- 
Ventory. Give reasons for your evaluation. State any psychometric 
Principles that seem involved. 

9. Compare carefully the Strong Vocational Interest Blanks (for 
men and for women) and the Kuder Preference Record, from the 
standpoint of mode of construction, psychometric principles, and 
Practical uses. 

IO. Many of the instruments here discussed require self-ratings. 
What do you think of the chances of serious falsification in the replies? 

I1. If you yourself or any one individual undertook to scale a 
number of statements revealing the strength of a certain attitude, 
Using the best possible judgment and common sense in the scale place- 
tment of the statements, in what respects would the result be less 
trustworthy than those obtained by Thurstone's method? 


CHAPTER IX 


APPLICATIONS OF MENTAL TESTS AND THEIR 
IMPLICATIONS FOR TESTING 


INTRODUCTION 


towards its further development. For this the ne 


Ccessary starting 
point is a review of some of the major and repres 


entative applica- 
pretation of the most important 


* Such is the purpose of the present 


to cover this ground. The purpose is not 
S of testing for their 


ome from the majo. 


chometric methods and instrumentalities. 


MENTALITY AND SocIoECONOMIC FacTors 


There is a wealth of evidence that m 
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mentality which is measured in various ways by psychometric 
instruments is not something that exists and functions in isolation 
from the general circumstances of life, and it is important to 
understand as clearly as possible what the relationship is, because 
it deeply conditions the proper use of the instruments themselves 
and the interpretation of their results. 

Ever since the analysis and publication of the results of the 
Army testing in World War I, the relationship between mentality 
and socioeconomic circumstances has been recognized. As hereto- 
fore pointed out, it was then found that occupations can be ranked 
in a rough hierarchy in respect to the median intelligence of those 
engaged in them. The evidence was brought together and sum- 
marized by Fryer (1922), and some samples of his findings are 
presented in Table 20. The relationship between intelligence and 
occupation, although regarded as significant, was not very definite. 
There was found to be much overlapping, and although success 
in a given calling might require a minimum intelligence there 
Was no clear-cut upper limit. This work has been widely supple- 
mented and amplified in detail, and various modifications and 
uncertainties have appeared. It has in general been confirmed by 
Harrell and Harrell, to whose study reference has been made, for 

“World War II testing. Also, there has been a great deal of debate 
about the reason for the relationship. On the one hand, it might 
be that a given low-grade occupation tended to select hereditary 
intelligence. On the other hand, it might be that such an occupa- 
tion had the effect of depressing intelligence, or at least of pre- 
venting its full development. Later on we shall have to return to 
these considerations. But whatever the cause of it, the fact of the 
relationship itself was of obvious importance for guidance and for 
the practical use of test results. Among all the evidence and find- 
ings that have emerged, however, the most important for our 
Present purpose is that occupational differences in particular, and 
Socioeconomic differences in general, reflect themselves not only in 
the intelligence of the adult workers in various callings, but also 


in that of their children. 


1. Child intelligence and parental occupation 

There seems to be no doubt that there is a basic relationship 
between the intelligence of children and the occupation of their 
parents. This has been formulated by Terman and Merrill (1937 b) 
in one of the best and clearest studies of the subject. Their over- 
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i ized in Table 32. It will be noticed that 
চট Rete 0 roughly 20 points in the mean intelligence 
Se লি of children whose parents are in the most favored and 
ROTEL occupational groups. At the same time, the over- 
lapping is very considerable. But in spite of this, the relationship 
manifests itself as significant. Nor does it seem to vary much with 
the chronological age of the children concerned. The data pre- 


TABLE 32 


MEAN STANFORD-BINET I.Q.’S CLASSIFIED ACCORDING TO FATHER’S 
OccUPATION 


(Terman and Merrill, 1937 b, Pp. 48) 


CHRONOLOGICAL AGES 
FATHER’S OCCUPATIONAL 
CLASSIFICATION 25% 6-9 10-14 15-18 
Le TPROFESSIONAL a aaasianae oa II6 IIS II8 116 
IL. Semiprofessional and man- 
UEC oi psa rsaansrans II2 IO7 II2 II 
II. Clerical, skilled trades, re- 
ELA OE ENE AR 108 105 107 IIo 
IV. Rural owners... 99 95 92 94 
V. Semiskilled, minor clerical, 
minor business... 104 105 103 107 
VI. Slightly skilled... 95 100 IOI 96 
VII. Day laborers, urban and ru- 
Li UY UE ease 2000. 94 96 97 98 
EM AE ria = ar aa LSU MO 


sented by Terman and Merrill are based on the Standardization 
group of the Revised Stanford-Binet scale, 


i Which, the reader will 
» Tepresentative, and drawn fr 
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being 132 of 2 years old, 126 of 3 years old, and 132 of 4 years old. 
She used the Kuhlmann Revision of the Binet scale, and obtained 
comparable results, with occupational-intelligence differences even 
more marked than those reported by Terman and Merrill, as will 
be seen from Table 33. Coffey and Wellman (g.v.), again, ran 
tests every 6 months from 1921 to 1934 on 417 young children at 
the University of Iowa Child Welfare Research Station. Classify- 
ing them in the occupational groupings adopted by Goodenough, 
they obtained similar results. The tests used were the Kuhlmann- 
Binet for those under 3 years old, and the Stanford Revision of 
the Binet scale for those over that age. Haggerty and Nash (g.v.) 


TABLE 33 


MEAN INTELLIGENCE QUOTIENTS CLASSIFIED BY OCCUPATIONAL 
GROUPS: CHILDREN 2 TO 4 C.A. 


(Adapted from Goodenough, 1928 c, p. 287) 


FATHER’S OCCUPATIONAL CLASSIFICATION N MEAN I.Q. 
J, ProfeSSibnal corsa one uate E ্‌ 56 125.0 
II. Semiprofessional eee 29 II9.7 
III. Clerical and skilled labor 129 113.4 
IV. Semiskilled and minor clerical... eee 79 108.0 
V. Slightly skilled 48 107.4 
VI. Unskilled! i etc aioe ae 43 alas see 39 95.8 
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again, using the Haggerty Intelligence Examination Delta 1 and 
Delta 2 with a total population of over 8,000 children, found that 
the same relation manifested itself with those of their group who 
Were in grades 3 to 8, as will be seen from Table 34. Occupational- 
intelligence differences, however, Were by no means so marked for 
children enrolled in high school, which is a significant and interest- 
ing finding. 

Terman (1925), again, 
Studies of genius, has repor 
Children is greatest among 
Public service group, and lowes 

Similar results have been 
Thompson (g.v.), in their survey 


in connection with his monumental 
ted that the proportion of highly gifted 
the professional group, next in the 
t in the industrial group. 

reported from abroad. Duff and 
of intelligence in the county of 
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খৃ berland, England, tested, in all, 13, 
EEE schools and ত children in the secondary schools. 
Children of “brain workers” made the highest Scores, the number 


being 1,722, and the mean I.Q., 106.6. Children of “hand workers” 
made the lowest scores, the number being 10,848, and the mean 
I.Q., 98.6. 


220 Children in the 


TABLE 34 
CHL DREN’S MEDIAN Ix 


TELLIGENCE QUOTIENTS CLASSIFIED By OccUuPaA- 
TIONAL GROUPS: 


ELEMENTARY ScHooL AND HicH ScHoor 
(Adapted from Haggerty and Nash, pp. 569-70) 


EE 


CHL DREN HicH Scroor 
FATHER’S OccUPATIONAL GRADEs III To VIII PuriLs 
CLASSIFICATION [Fs TEE 3 
N | MedianI 0: N | MedianI 0 
| — 
I. Professional... 49 II6 20I I2I 
II. Business and clerical. . 944 107 374 II2 
I. Skilled labor... 1028 98 54 III 
IV. Semiskilled labor 524 
V. Farmer 
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relationship is clearly and typically shown in Table 35. The two 
Small communities where the tests were run differed markedly in 
cultural and general advantages. The Kansas town had a growing 
Population, improving library facilities, good recreational oppor- 
tunities, and so forth. The Ohio community was exactly the oppo- 
site in all chief ratable respects. The extent to which these dif- 
ferences showed up in the test performance of the children is 
Strikingly revealed in the tabulation. (For primary sources of this 
material see Pintner, 1917; Paterson, 1918; Pintner, 1937.) Simi- 
lar differences are found to appear as between the children of 


TABLE 35 


INTELLIGENCE RATINGS OF CHILDREN IN Two COMMUNITIES, AFTER 
PINTNER (1917), PATERSON (1918) 


(Quoted from Pintner, 1931, P.- 246) 


i Kansas Ohio 
HIME Rote percentages percentages 

Very bright 4.2 0.7 
BENE sess arctairees. a cf 7 15.6 5:8 
Normal ..... 66.0 65.6 
Backward IL4 25.3 
Dull se suds sade an 2% 2.6 
Number of cases eee ন ও 332 154 


C——— oo 


different cities which are definitely separable in socioeconomic and 
cultural status (Pressey, 1919). Very great differences have been 
found also in the intelligence test performance of children in dif- 
ferent schools in the same city. Thus Dickson and Norton (g.v.) 
have reported a range of average Sth grade intelligence in 29 
Schools as being from 48 to 109 points in terms of test scores. with 
the median at 81, and that intelligence averages are closely asso- 
ciated with the socioeconomic status of the immediate neighbor- 
hood of the school. Maller (9.2), too, studied all sth grade chil- 
dren in 273 health centers in New York, the total number of 
subjects being 100,153. He found astounding differences in per- 
formance on the National Intelligence Tests and the Pintner 
Rapid Survey Test, the mean I.Q. of the lowest rating area being 
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74, and that of the highest, 118. Here again levels of intelligence 
test performance closely parallel socioeconomic status and cir- 
cumstances. Finally, Thorndike and Woodyard ( 1942), studying 
the National Intelligence Test scores of 6t 
cities, find the usual wide range and overla 
high correlations of test scores with ‘ 
qualities of residents,” and “Der capita income.” 
Sherman and Key, and Sherman and Henry (g.o.), 


the most important and interesting investigations of this 
made a study of the i i 


“hollows,” i.e., isolated 


€, Various 
enough Drawing Scale. As may 


ér some of their 
TABLE 36 
NGS oF “HorLow FoLx” 


(Sherman and Key, p. 283) 


I 
Mouxram 
TEsT (“Hor Low”) ViLLacE 
FE 
N Av. 1.0. N Av. 1.0. 

Stanford-Binet... 32 61.5 
National Intelligence Test. 2 61.2 হী 
Pintner-Cunningham 2 75.9 96.1 
Performance Tests... 54 82.0 ES 87.6 
Goodenough... হ 63 42.9 6 ৮ 


results, the village 
but the test performance of the ¢ 
Moreover, the mean intelligence of the latte 
Ship to the isolation and lack i 


Of privilege of uu 1 
they lived. This Work is of Dp li onl 4 EE 


APPLICATION OF MENTAL TESTS 3o1 


wide range of tests used, which included performance tests. Simi- 
lar results were obtained by Gordon (g.v.) in his frequently cited 
study of underprivileged, isolated, and often illiterate children of 
canal-boat families in England, whose mean Stanford-Binet IQ. 
Was found to be 69.6. Hirsch (1928) has in general confirmed the 
findings of Sherman and Key in his own study of Kentucky moun- 
taineer children, though his work was less well controlled. And 
Pressey and Thomas (g.v.\ showed that children in a “good” 
farming district were definitely superior in intelligence test per- 
formance to those in a “poor” one. 

At the same time we must be on our guard against uncritically 
accepting global community ratings as safe socioeconomic indices. 
Armstrong's results (9.v.) revealed no inferiority in the test per- 
formance of rural children when only those of native American 
stock equal in occupational classification and in educational op- 
portunity to urban children were considered. She remarks that it 
is not necessarily the rural environment as such that is involved, 
but environment Plus various frequently found concomitants; and 
She suggests that other things being made equal, the country is 
probably as beneficent an environment as the city in which to 
raise children. This is a salutary caution, for without prejudging 
the nature of the relationship or considering whether it is one of 
Cause and effect, the bearing of the socioeconomic setting upon 
intelligence test performance is by no means simple, and indeed 
requires a good deal more analysis than it has so far received. 


3. The question of causation 


The established fact is that, although there is much overlapping, 
intelligence test scores maintain a definite relationship with socio- 
economic status. There have been numerous attempts at explana- 
tion, none of them wholly convincing. 

A. One hypothesis has been that the relationship may be due 
to selection. The less able members of an underprivileged com- 
munity may tend to remain in it while the abler tend to migrate. 
Less capable persons may gravitate towards the less desirable 
calling. Various investigators, among them Hirsch (1928) and 
Pressey and Ralston, have made this assumption. But there is 
10 satisfactory proof of its truth. Duff and Thompson, in their 
study of the distribution of intelligence in Northumberland, found 
that average test performance was higher for rural children far 
from cities than for those living near them and thought this might 
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be due to the latter having greater opportunities to migrate so 
that selection would be more pronounced. But their work was 
done in the rather special setting of rural northern England, and 
the finding has not been confirmed elsewhere as a general phe- 
nomenon. All one can say is that selective migration of the abler 
members of underprivileged communities and the selection of low 
intelligence jobs by underprivileged persons is not impossible, the 
second appearing particularly plausible. But there is no substan- 
tial evidence either way. 

B. The type of test used in many of the investigations is often 
thought to favor the abler and more privileged individuals, and 
particularly the urban children. Undoubtedly this is a factor to 
be considered. Shimberg (g.v.), for instance, constructed an infor- 
mation test with two comparable forms. One of these two forms 
She scaled on an urban standardization group and the other on a 
rural standardization group, and when they were given to new 


subjects, the urban children exceeded the scores of the rural 
children on the urban form, and vice ve 


out, however, that this was an informati 
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some extent a function of the mental processes being tested. While 
city children exceeded country children on verbal items and sub- 
tests, country children were definitely superior to urban children 
On such subtests as the Mare and Foal and on most of the form 
board tests. 

It is not true, however, that when performance tests are sub- 
stituted for verbal tests, socioeconomic differences in test per- 
formance disappear. They are usually reduced, and an instance of 
this may be seen in the tabulation of the results of Sherman and 
Key in Table 36. Even this, however, has not always been found 
to be the case. Thus L. W. Pressey (g.v.) tested 357 children 6 to 
8 years of age in grades 1 to 3, using the Pressey Primary Scale, 
Which is a nonverbal test. The belief was that the use of this test 
would tend to minimize the socioeconomic differentiation of the 
rural children, yet only 22% of them reached the urban age 
medians. Thus the facts of the situation are by no means unequiv- 
ocally established. 

And even if performance test scores do show a less marked 
relationship to socioeconomic differences than do the scores of 
verbal intelligence tests, the question still remains as to which 
type of instrument is the more valid and important indicator of 
mental status. We have already seen that the two types of tests 
do not show very high correlations with one another. Thus it 
Would be most improbable that both of them would show the same 
pattern of relationship with a third factor. Also, it may be said 
that there is little doubt which of these two types of test is on 
the whole better constructed, or which yields scores of greater 
general significance. The point we are discussing is often expressed 
as saying that verbal tests are “unfair” to children in under- 
privileged groups. But it may be that the differences they reveal 
correspond to the facts. 

C. A very striking and suggestive finding is that the differen- 
tiation associated with an unfavorable socioeconomic setting is 
often cumulative—that is, it becomes more marked the longer the 
persons concerned remain in it. Thus Baldwin, Fillmore, and 
Hadley (g.v.) report that infants in an underprivileged rural set- 
ting are not below the medians for infants in privileged urban 
settings, but that differentiation increases with age. Sherman and 
Key have shown that intelligence ratings tend to drop, and de- 
clines below test norms to increase as children in underprivileged 
circumstances grow older, this being particularly true of children 
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in the remoter “hollows” studied. H. Gordon (g.v.) also finds a 
striking drop in the intelligence ratings of canal-boat children as 
they grow older. In many families the youngest children would 
group around the I.Q. levels of 90 to 100, while the oldest children 
would test almost feeble-minded. Honzik (1940), again, demon- 
strate an increasing relationship between the intelligence test 
performance of children as age increases with such factors as 
mother’s intelligence, parental education, and the general socio- 
economic status of the home. At the age of 271 months her data 
Show almost no relationship, but at 8 years the correlations range 
from .33 to .55. Crissy (g.v.), too, found that young children 
studied by her lost an average of about 10 points I.Q. in 18 
months in an orphanage environment. Honzik (1940) also shows 
that correlations between child intelligence and maternal intelli- 
gence show a striking rise from 21 months to 8 years, being 
negligible at the former age and definitely significant at the latter. 
The last three studies go beyond the socioeconomic factors we are 
now immediately considering, but they contribute to our aware- 
ness that the relationship between intelligence test performance 
and the type of setting may grow closet over a period of time. 
Such cumulative effects, it is true, do not always appear. Thus 
Hirsch (1928) has reported a correlation of —.23 between I.Q. 
and C.A. for the underprivileged mountain children he studied. 
This means a tendency for test performance to fall with length 
of time in the environment, but he dismisses the coefficient as too 
small to be significant, which, however, is a matter of opinion, for 


it admittedly has been found. Other Studies, too, have failed to 
Show a cumulative relationship, but it has been found in many 
instances, whatever the reason may be. 


D. In summary, two points emerge. (a) No clear-cut causal 
explanation of the repeatedly demonstrated relationship between 
socioeconomic circumstances and intelligence test scores is forth- 
coming. The problem will be reopened on a Wider basis in con- 
nection with a later discussion of hereditary and environmental 
influences. But the evidence up to this point manifestly disallows 
partisan attempts to attribute everything either to the one or to 
the other. Moreover, the hereditarian-environmentalist controversy 
by no means possesses the all-obscuring importance sometimes 
attributed to it. (b) The positive and very important truth that 
unmistakably emerges and will be further confirmed is that a 
person’s mental test performance is a Concomitant of the total 
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circumstances of his life. It is related to the conditions created 
by the occupation of his parents, if he is a child in the home. 
Incidentally we may note here a highly significant point. The 
relationship is with the general type and circumstances of that 
occupation, and not directly with its financial rewards. Also, intel- 
ligence test performance is related to the conditions created by 
the general community setting in which the individual lives. Fur- 
thermore, it would appear that the longer a given set of circum- 
stances, favorable or unfavorable, prevails, the closer the relation- 
Ship is likely to become. This is very far indeed from detracting 
from the value of mental tests, for when it is found that whatever 
they may measure interacts widely with the sum-total of the 
Subject’s living, a wealth of psychological content and significance 
is indicated. It is just what, on general grounds, one would think 
ought to be found if psychometric instruments are more than 
trivialities. What is clearly inadmissible—and it would be disas- 
trous to the significance of the tests themselves—is to think that 
they reveal some factor in the human mental make-up quite 
unrelated and unresponsive to anything else. 


FAMILY RELATIONSHIPS AND MENTALITY 


The question of the relation of mentality as indicated by test 
performance to the home and family has been widely studied. 
Pintner (1937), summarizing the evidence up to about 1929, finds 
a hierarchy of correlations which is shown in Table 37. It shows 
the resemblance in test performance for various degrees of blood 
relationship, and it shows very clearly that resemblances in men- 
tality steadily decrease as the relationship between the individuals 
becomes more and more remote. Of course, it must be remem- 
bered that the actual correlations obtained in the large number 
Of studies here summarized vary quite widely about the rough 
averages shown in the table. Some of the most important reasons 
for such variations are the use of different tests, the application 
of them to different groups, and the lack of uniformity in the 
general conditions under which the investigations were carried 
on. But On the whole it would seem that the tabulated coefficients, 
Which reveal a decidedly impressive hierarchy of descending re- 
semblance, Well represent the established facts. 

Since the date of this summary, however, much work has been 
done in attempting further to analyze and better to understand 
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TABLE 37 


AVERAGE CORRELATIONS IN INTELLIGENCE OF PERSONS IN VARIOUS 
DEGREES OF RELATIONSHIP 


(Quoted in part from Pintner, 1931, Pp. 512) 


e y Average 
Relationship OTTO 
EOEUUCEL IDS remains teTADEER SHE dns anaioniaras .90 
All twins «75 
Fraternal twins .7o 
Siblings .50 
Cousins tl 
Unrelated individuals ... পে ও ন .00 
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1. The effect of foster home environment 
Table 38 brin 
studies of the effect of foster ho 
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ry definite relationship be- 


দু Ported by Freeman 
Holzinger, and Mitchell are Consistently higher than those ob 


fained by Burks (g.v.) and Leahy (1935). One, explanation that 
has been suggested is that the foster children studied by Freeman 
had been placed to some extent selectively ; ie., there was a 
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i uperior hereditary endowment to be 
be a le LE and vice versa. There is some 
t Delieve that this may have affected his findings, for the 
Fo enchidren Whom he investigated were adopted at a mean age 
he ears and 2 months, whereas those of Burks were placed 
Vets the age of 13 months, and those of Leahy before 4 months. 
Leahy, in particular, took special pains to avoid the disturbing in- 
fluence of selective placement. And this in any case would be more 
apt to happen with older children whose characteri 
ities had begun to become apparent. 
B. Freeman has reported an increase in the correlations of 
child mentality with foster home characteristics with length of 
residence in the foster home. In 74 Cases the children, placed at 


an average age of 8 years, were tested shortly before placement, 
and the mean correlation with home characteristics was .34. When 


these children were tested again 4 years later, their mental test 
Performance now correlated .52 with the characteristics of the 
homes in which they had been adopted. 

C. For those children who were placed at about the age of 

ears, Freeman finds a mean increase of about 5 points in I.Q. 
মর 4 years of residence in the better type of foster home. There 
Was, however, no increase in I.Q. with those placed in the poorer 
type of foster home. In the case of children Placed in foster homes 
after the age of 12 


years, there was no gain in I.Q. with residence 
in the new home. 


D. Where siblings were Placed in different foster homes and 
tested after a fairl i i 


Stics and capac- 


Seen that she 
€ mentality of 
Tr homes on the 
Ver, these low Correlations 
are to some extent offset by certain other findings which she 
reports. Thus she finds virtually zero relationship between the 
socioeconomic status of the true paren 


ts and the intelligence of 
their children who had been adopted and had been living in foster 


between th 
to the foste 
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homes for 2 years. This, of course, would seem to undermine the 
hereditarian conclusions suggested by the lack of effect she reports 
for the foster home environment. According to her, foster child 
mentality shows no relationship either to the characteristics of the 
foster home environment or to the socioeconomic characteristics 
of the child’s own parents. Burks concludes, on the basis of some- 
what dubious reasoning, that the best foster home may contribute 
as much as 20 additional points to a child’s intelligence quotient, 
and that the worst foster home may lower it as much as 20 points. 
She seems to consider an influence which can do no more than 
this as meager and unimportant. But other commentators have 
remarked that if such an influence can produce a total of 40 points 
Variation in I.Q., it is very considerable indeed—enough, for 
instance, to make the difference between classification as feeble- 
minded or on the upper border line of normal intelligence. 

EF. The consistently and strikingly low correlations between 
foster child I.Q. and foster home ratings which have been reported 
by Leahy (1935) may be explained at least in part by the interest- 
ing data presented in Table 39. In this table Neff (g.v.) has 

TABLE 39 sf 
I.Q.’s oF CHILDREN CLASSIFIED BY OCCUPATIONAL STATUS OF 
FOSTER PARENTS 


(Leahy’s data tabulated by Neff, Table 5, Pp. 744) 


Occupation N Mean 1.0. 5D, 

LAB OiesStonal Edt ue SED 43 3 I2 
II. Semiprofessional 38 II2 

JIT. SEilletd Trades: cs une see: 44 III — 

IVs Rural Nowteter oc ea lds nas — ! Ee) 

V. Semiskilled 45 109 I2 

VI. Unskilled = — = 


brought together some of the original data reported by Leahy but 
not tabulated by her. The striking thing about it is the remark- 
ably low and slight differences in the mean I.Q.’s of children of 
parents of different occupational groups. Whereas in Tables 32 
and 33 the differences in mean I.Q. between the children of parents 
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of the highest and lowest occupational Sroups range from 20 to 30 
points, here the spread is only 4 points. This indicates two possi- 
bilities. (a) The range of socioeconomic differences in the groups 
studied by Leahy may well have been unusually small. In par- 
ticular, the true status of her semiskilled group may have been 
in fact higher and more closely similar to that of her professional, 
semiprofessional and skilled trades groups than the words them- 
selves would suggest, or than is ordinarily found. If this were the 
case, it would have the effect of reducing the correlations based 
upon these groups for purely statistical reasons. (b) Another pos- 
sibility that has been suggested is that foster children perhaps 
receive unusual care and attention, so that once again the real 
and effective status of the homes of the semiskilled group with 
reference to influences bearing on the children was higher than 
might appear. 

The studies by Freeman, Burks, and Leahy, which have just 
been considered, used a similar methodolog 


Child mentality. 

G. Skeels (g.v.) studied 147 foster childre 
average age of 2.7 months, and none of 
months. The Kuhlmann-Binet Was used f 
years old, and the Stanford-Binet for those 


these children had virtually no relationship to the Socioeconomic 
status and intelligence of their true parents. 

H. Skodak (g.v.) continued this work by stud 
who were adopted very Young. She found that th 
homes had a definitely beneficial effect in terms of 3 সজ 
test performance. The foster homes were rated on a rather abe 
rate home inventory scale that emphasized many cultural and 
stimulating factors but disregarded economic status Which is 
probably quite misleading. On this inventory the homes ee 


ying 154 children 
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classified in deciles, i.e., in ten-point steps of the scale scores 
Which indicated their general excellence. For each such ten-point 
rise in home status she found a reliable increase in the intelli- 
gence of the foster children there residing. On the face of it, this 
has the appearance of what almost amounts to definite proof of 
the effect of good home conditions in improving mentality. But 
extreme and hasty conclusions should be avoided, because the 
home inventory scale itself was somewhat subjective, and the 
persons who used it to rate the homes already knew the I.Q.’s of 
the children, which may have influenced their judgment. Also, 
correlations that were worked out on the Skodak data by Good- 
enough (1940) and tabulated show little relationship between 
the criterion of the foster father’s education and the I.Q. of the 
foster child. It does not seem, however, that these reservations by 
any means completely undercut the conclusions of Skodak. 

I. Speer (1940b) showed that the correlation of foster child 
intelligence to the intelligence of foster mothers is directly related 
to the length of time the child has stayed in his own home before 
adoption. That is to say, the longer the child is retained in his 
own home, the less his mental resemblance to his foster mother, 
and the earlier he is adopted the greater the resemblance. 

J. Harms (g.v.) studied a group of adopted children whose true 
parents were of low mentality and inferior socioeconomic status. 
When these children were tested after 5 years of residence in 
foster homes superior to their own, their mentality showed a strik- 
ing rise above expectancy. Thus the mean I1.Q. of 87 children 
whose true mothers had a mean I.Q. of about 63 was 106. 

These results are of course very impressive, and their publica- 
tion has aroused something of a furore of discussion. But one must 
not rush to extreme explanations, or suppose that the environ- 
mentalist case has been proved beyond a reasonable doubt. For 
one thing it is necessary to recall the unreliability of much testing; 
particularly when applied to adults. This means that the reported 
intelligence levels of the true and foster parents studied are open 
to a good deal of question. Indeed, the testing of these adults is 
not above criticism, even in the best of the studies. For this pur- 
pose Leahy, for example, used the Otis Self-Administering Test 
of Mental Ability, which is none too satisfactory for adults. Also, 
the various indices and methods used to rate the homes studied 
are far from perfectly reliable and adequate. 

Still, it does seem reasonably well established that adopted chil- 
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dren in good homes do in fact achieve mental test performances 
decidedly in advance of expectation, that prolonged Stay in such 
homes may have a beneficial effect, and that the resemblance of 
these children to their own parents in respect of mentality is much 
less close than might be anticipated. One should not argue out of 
hand that biological heredity has no effect on intelligence, but 
only that under certain circumstances, which are not fully under- 
stood, but which probably include exposure to the environmental 


influence at an early age, a good home has a very appreciable 
influence. 


2. Twin resemblance 


All investigations have shown a high resemblance in mentality 
between twins, and a very high one between identical twins. How- 
ever, the crucial issue that has been raised in recent investigations 
is what happens to this resemblance when twins are raised apart. 

Hirsch (1930 b) dealt with this problem some years ago, and 
was able to find only a very slight effect. The resemblance be- 
tween his twin pairs was not significantly reduced When they were 
raised in separate environments. However, he was able to deal 
only with 4 pairs of twins separated rather late in life. 

By all means, the most important and decisive study on the 
topic is that by Newman, Freeman, and Holzinger (g.v.). They 
investigated a very considerable number of twins, including 50 
pairs of identical twins raised together, So pairs of fraternal twins 
raised together, and 19 pairs of identical twins raised apart and 
separated early in life. This latter is the crucial Portion of the 
investigation. The investigators used a large number of tests 
including the Stanford-Binet scale, the Otis Self-Administering 
Tests of Mental Ability, the American Council on Education 
Psychological Examination for High School Students, the Inter- 
national Intelligence Test, and the Stanford Achievement Test 
The environment in which these twins were situated was rated by 
five judges on a number of carefully formulated criteria, and the 
mean of the ratings was assigned as the environmental score 
value. The first question was to what extent differences in the 
environments of the identical twins raised apart reflect themselves 
in differences in mentality. Putting the matter otherwise would 
there be a tendency for more and more marked environmental 
differences to be reflected in more and More marked differences 

in mentality in these very closely related children? The answer 
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is contained in Table 40. Height and weight were virtually not 
differentiated at all by differing educational and social environ- 
ments. Height was affected diversely by differing health envi- 
ronments. Intelligence test performance was differentiated to a 
marked degree by differing educationa] environments, and to a 
considerable though slightly smaller degree by differing social 
environments. Educational achievement was very greatly differ- 


TABLE 40 


CORRELATIONS BETWEEN ENVIRONMENTAL DIFFERENCES AND TRAIT 
DIFFERENCES FOR IDENTICAL TWINS REARED APART 


(Newman, Freeman, and Holzinger, from Table 93, P. 340) 


ENVIRONMENTAL DIFFERENCE 
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entiated by differing educational environments, and only quite 
moderately by differing social environments. 

Correlations between three groups—identical twins raised to- 
gether, fraternal twins raised together, and identical twins raised 
apart—are shown in Table 4r. For all the indices of mentality 
the drop in the correlations for the third group in contrast to the 
first is very striking. Identical twins raised apart, according to 
the data of this study, resemble one another distinctly less than 
fraternal twins raised together, and hardlv more than siblings 
raised together, as a comparison between Table 27 and Table 4r 
will show. More than this, the investigators point out that the 
data do not tell the whole story. For the separate environments 
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in which their identical twin Pairs were raised apart from one 
another were not extremely dissimilar. If one member of each 
pair had been raised under slum conditions, and the other in the 
best and most stimulating home and cultural setting that could 
be devised, the resulting differences would Probably have been 
much greater. 


TABLE 41 
CORRELATIONS ON VARIOUS TRAITS OF THREE GRoUPs oF Twin PAIRs 


(Newman, Freeman, and Holzinger, Table 96, P. 347) 


Identical, Fraternal, Identical, 
Traits raised raised raised 
together together apart 
Standing height ... 981 934 .969 
Sitting height .. . 965 .9or .960 
WAR Eisen nun oes A ig 973 .900 886 
Ftd Jeieth. cases swans asa & .910 691 917 
Head width .. .908 654 ‘880 
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Still, here again it remains NECessary to exercise caution in 
drawing general conclusions. Even so thorou 
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but it is first to ascertain what seems to be the 
mental change on some particular 
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SCHOOLING AND MENTALITY 


1. Outstanding facts that have been established 


A very large number of investigations in which mental tests 
have been administered to school groups have yielded certain out- 
standing facts which may be regarded as thoroughly established. 

A. Schooling selects intelligence. That is to say, the less intelli- 
gent tend to drop out. Mean intelligence scores tend to rise grade 
by grade, particularly in high school and college. Today this 
tendency is less marked in high school than it used to be, but it 
Still manifests itself in the upper high school levels and in college. 
Many colleges enroll few students whose intelligence is not above 
the population mean. 

B. There is a marked positive relationship between intelligence 
test scores and school achievement. This centers approximately 
around a correlation of .50 or perhaps somewhat lower between 
intelligence and average grade. The relationship varies consider- 
ably, however, and is lower for some levels of schoolwork and for, 
some kinds of schoolwork than for others. It is much less deter- 
minate and also decidedly lower on the whole for achievement in 
separate school subjects. 

C. The selectivity of different institutions differs very greatly. 
The ablest student in one college may be inferior to the least able 
in another. Thus what is called a “good college risk” will depend 
Upon the institution concerned. 

D. Many other factors besides mentality and mental ability 

determine both intention to continue with an education and actual 
continuation. Economic status is undoubtedly a major influence. 
This was shown many years ago by Book (g.%.) and Counts 
(Q.v.), and recently by Davis (g.v.) and by Karpinos and Somers 
(g.v.). Thus it has been said that anyone who has the price and 
is willing to spend the money can get a college degree in the 
United States, irrespective of his mental capacity. 
b But while these findings are highly enlightening, and of major 
importance in many ways, the prime question which has recently 
come to the fore is whether schooling actually affects intelligence 
as well as tending to select it. To this we now turn. 


2. The effect of preschool attendance 


The effect of preschool attendance upon mentality has become 
a very prominent issue in recent years. Clearly it suggests many 
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far-reaching and momentous implications. The evidence, So far 
as it can be tabulated, which is not at all completely, is shown in 
Table 42. The investigators there cited report consistently that 
some gains in mean I.Q. occur during preschool attendance. Some- 
times they are small enough to be considered negligible or doubt- 
ful. But sometimes, on the surface at least and in advance of 
analysis, they appear quite marked, and above all some of them 
are cumulative, increasing with the length of stay in preschool. 


TABLE 42 


GAINS IN I.Q. OVER VARIOUS PERIODS OF PRESCHOOL ATTENDANCE 


ONE YEAR Two YEARS | THREE YEARS 
INVESTIGATOR 
N Gain N Gain N Gain 
AndEISOR 2 eae ac stua aos 26 2.6 
BiNd i spre Saas 54 1.8 
Frandsen and Barlow... 30 ৰ} 29 14.2 
Goodenough... 84 4.6 51 6.2 2] 5.8 
Starkweather and Roberts 103 5.5 
WC naa Bat EY. 652 6.6 228 10.4 67 10.5 


The most important addition 
marized as follows. 

A. McHugh (1940) reports an avera 
LO. during an average preschool attendance 
for 91 children. He points out that during thi. 
the same gain took place as that reported 


al work on the problem is sum- 


therefore be disregarded. 

B. Kawin and Hoefer ( 
points in mental age during 
30 weeks in preschool for 
their control group, which was Carefully paired with the experi- 
mental group, and which did not attend Preschool, made about 
the same gain. They therefore question the effect of t 
environment. 
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C. Olson and Hughes (g.v.) report consistent gains in IQ. 
during a nursery school attendance period of 2 years, in com- 
Parison with a control group of children who did not attend 
nursery school. But they attribute this gain not to the influence 
of the nursery school itself, but to more general socioeconomic 
influences, for when socioeconomic factors were equated, the ad- 
vantage of the nursery school group disappeared. 

D. Pegram (g.v.) reports that from 400 to 599 days of pre- 
school attendance gave 40 children an advantage of 4.9 points in 
I.Q. over a group of 40 matched control children who did not 
attend. y 

E. Skeels and Dye (g.v.) report a mean gain of 27.5 1.Q. points 
for 13 children transferred at the mean age of 19 months from a 
very impoverished to a stimulating environment which was in 
effect that of a preschool. 

F. Skeels, Updegraff, Wellman, and Williams (g.v.) report that 
a group of children placed in a preschool described as good but 
not equal to the best gained 3.7 I.Q. points in from 200 to 309 
days, and 4.6 points in an attendance period of 400 days and over, 
While controls matched with them who did not attend lost an 
average of 1.2 and 4.6 points in the same periods. 


G. Woolley (g.v.), in what is the pioneer study of the problem, 
Published in 1925, studied an experimental group made up of 
Pupils enrolled in the Merrill-Palmer School, and compared both 
to what she refers to as the “Terman” group. By the “Terman” 
group she means the group of children used by Terman for his 
report on the constancy of the I.Q. which is tabulated on page 141 
of his The Intelligence of School Children. A portion of this table 
appears in this book as Table 48. Woolley’s findings are shown 
in Table 43. As will be seen, she showed a much higher percentage 
of children who gained in I1.Q. and much higher mean gains among 
those attending the Merrill-Palmer School than among the com- 
parable waiting-list children, or among those included in the 
“Terman” group. 

A pertinent question that has been raised is whether the gains 
in L.Q. that have been reported during preschool attendance are 
permanent. Here the most striking study is that by Wellman 
(1937). She followed through two groups from preschool to col- 
lege. The first of these two groups was given the Stanford-Binet 
fests at a mean age of 66 months, and took the College Entrance 
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ifying Examination at the University of Iowa at a mean age of 
months. Of this first group, 21 had attended Preschool. 
The second group of 57 members in all took the Qualifying Exam- 
ination at the same mean age as the first, and had been given the 
Stanford-Binet at the mean age of 72 months. None of them 
had attended preschool. Of this second group, 21 members were 


TABLE 43 


I.Q. CHANGES IN THREE GROUPS. CHANGE OF 5 CONSIDERED CONSTANT 
(Adapted from Woolley, Table 4, P. 478) 


MERRILL-PALMER MERRILL-PALMER 


ScHooL ScHooL Warr- TERMAN GROUP 
PUPILS ING LisT 


N Per- | Av. N Per- | Av. N Per- | Av. 
cent | change cent | change cent 


change 
Increase | 27 | 63 19.7 | 12 | 33 12.7 | 25 | 25.| 20.6 
Decrease] 8 | 18.5 | 10.8 13 | 36 16.2 27 27 IIL.Y 
Constant] 8 | 18.5 £I | 3 47 47 
Totals | 43 36 99 
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hand, Hildreth (1928) found that while children entering the first 
grade after a considerable time in preschool tested on an average 
6 points above children otherwise comparable who had not at- 
tended, yet after 18 months in the elementary school the average 
I.Q. of the first group had dropped, the average 1.Q. of the second 
group had risen, and the original difference of 6 points was re- 
duced to 2 points in favor of the first group. T. J. Peterson (q.v.), 
too, investigating a group entering the University of Towa Ele- 
mentary School, some with a preschool background and some not, 
found the former 3.6 L.Q. points ahead at entry, but only 2.6 
points ahead at the end of the year. Finally Voas (g.v.) reports 
that a group of 111 nursery school “graduates” in the schools of 
Winnetka, Illinois, on several subsequent Binet testings at several 
age levels, had almost the same mean I1.Q. as the total 896 pupils 
tested, and were in this respect indistinguishable from them. 

Wellman (1945), summarizing the literature to date, finds that 
of 22 preschool groups 11 had Sténford-Binet gains of 6 points or 
more, the total number of cases being 1537- Of 14 non-preschool 
groups, only 2 had similar gains. She repeats once again her con- 
tention that the results at the University of Iowa are not unique. 

So much for the data. Now for an attempt to appraise them.* 

A. First, as to the general trend revealed, three statements are 
in order. (a) As expressed in averages, the trend is unmistakable. 
In almost every one of the studies some mean gain during pre- 
school attendance is reported, however it may be explained or 
qualified. (b) The mean gains reported are sometimes small, and 
in these cases the authors often consider them negligible and 
meaningless. However, the obtained averages do not tell the whole 
story. Thus Grace Bird, who finds an average gain of only 1.8 
points (see Table 42) and is inclined to dismiss it, also reports 
individual gains ranging from 0 to 25 points. And Wellman (1932), 
whose reported mean gains are considerable, finds individual gains 
in L.Q. running as high as 40 points. It may quite possibly be that 
the preschool environment has a greater effect on some children 
than on others. (c) Consideration should be given to the popula- 
tions used in the various studies. These will be found in Table 42 
and in the summary presented above. It will be seen that the lowa 

* The evaluation here presented is based to a considerable extent upon the 
discussions by Goodenough, 1940; McHugh ; Stoddard, 1940, 1943; Stoddard and 


Wellman; and Wellman. 1940. The reader is referred to these sources for a fuller 
treatment. 
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group reported on by Wellman, for which substantial and increas- 
[o) 


ing mean gains were found, is much larger than all the others: 


combined. 

B. The statistical basis of the findings, Particularly those from 
Iowa, has been critically examined and reviewed by McNemar 
(1940, 1945), and reworked with new techniques by Wellman and 
Pegram (g.v.). The general upshot seems to be that intelligence 
quotient changes are at any rate associated with preschool experi- 
ence, whether caused by them or not. Also a number of critics 
have argued that the reported gains cannot be considered well 
established because of errors of measurement and imperfections 
In the tests used. Two points are here involved. 

(a) A typical finding is that during preschool attendance low 
L.Q.'s tend to rise, medium L.Q.'s are less affected, and very high 
ones tend to fall. This situation is exemplified in the data from 
Wellman (1932) presented in Table 44. It will be noted that 


TABLE 44 


PERCENTILE GAINS OF CH.DREN CLAasstrr 


ED BY I.Q. OVER ONE AND 
Two VYEARs or PREsScHooL 


ATTENDANCE 
(Wellman, 1932, Table 3, p. 53) 


Classical N 0. 0f |Gain one : No.of Gain two 
children | year children | years 

Below average ... 19 22.1 19 10 28.0 
Average ..... +-.| 104 23.6 2I 35 206 
Superior . 65 IS.I | 20.6 26 2 { 
Very superior . 61 6.7 10.2 24 te 
Genius .... I8 —3.9 10.2 ll 7 ঠৰ 
whereas the 10 Children classed below Average make a mean ain 
of 36 points in 2 years, the 7 Who are in the ‘ চল 


i ‘genius” ce 
lose 2.1 points in the sam. Benlus” category 
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make the obtained gains illusory and transient. It is somewhat 
difficult to follow this contention, for the numbers in the “genius” 
Category are quite small, and their average loss of I.Q. is certainly 
very slight. 

(b) An altogether more important objection is that mental 
measurements at early ages are of doubtful reliability and validity 
and do not predict later test performance or mental status well. 
This has already been noticed in our previous discussion of mental 
tests for young children. It has grave implications for the present 
problem, since if earlier and later testings are unrelated, one can- 
not say whether the alleged gains mean anything or even if they 
are real. 

Wellman (1940) has made a detailed reply as follows. (a) Test- 
ing of young children reveals a wide range of I.Q. changes asso- 
ciated with preschool attendance even among selected cultural 
groups. (b) Such gains are found consistently for all ages as 
between fall and spring testings, whereas unreliability would pro- 
duce both losses and gains. (c) There are only small sex differ- 
ences in I.Q. gains, another piece of presumptive evidence for 
reliability. (d) Test-retest correlations at preschool ages are within 
the range of test-retest correlations for a wide range of school 
ages. Nemzek (1933 b), reporting on the whole research literature 
to that date, finds a mean test-retest correlation for the Stanford- 
Binet scale of .83 for all ages. Wellman's own test-retest correla- 
tions at preschool ages are from .88 to .90. 

There is another point which Wellman does not mention, though 
it is of considerable importance. The low predictive efficiency of 
early testing has been exaggerated, and represented as a good 
deal more universal than it is. As will be recalled, our survey of 
the data on early testing indicated that while prediction is very 
low at very early ages, by about the third year it begins to become 
appreciable. 

Taken together with the facts, these arguments have much 
force. Test scores in early childhood may be doubtful indicators 
of later status, although the claim needs considerable qualification. 
On the other hand, such consistent and widespread gains as have 
been discovered are impressive and must have some explanation. 
It is on this datum that Wellman ultimately and very properly 
rests her case, and it cannot be disrupted by any a priori 
ment, or indeed by anything short of a direct EAE fs 
the alleged gains themselves do not take place. All any less de- 
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cisive argument can properly do is to leave a certain reasonable 
doubt as to basic explanations. | | 
C. Another possibility that has been Suggested is that the gains 
in L.Q. may arise from a practice effect Produced by the first test- 
ing, which makes later test performance better, or perhaps from 
a general adjustment to school conditions which makes the child 


ment account for all the reported results, although very likely 
they do have some influence. 


D. Several of the studies, for instance that of Goodenough 
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Sreatest gain, whereas the superior or very superior child may 
make little or none, or even lose. For it is precisely the superior 
Children who are likely to enjoy the better and more stimulating 
home conditions. Surely, it is no argument against a good hospital 
to say that some people have such good homes they do not need 
its services. Many other people emphatically do need them! 

E. In summary, the evidence cannot be considered conclusive, 
and broad statements of cause and effect should certainly be 
avoided. Such statements are not only unsupported by what has 
been discovered, but also, and more importantly, they deflect 
attention from the real issue. 

What appears is that certain types of school environment are 
associated with superior mental test performance, and more spe- 
cifically with improved mental test performance. Just why such 
gains take place and whether they are as lasting as Wellman 
contends are questions to be treated with much reservation. The 
question obviously for anyone who wishes to deal constructively 
With human nature is what the characteristics of such favorable 
environmental settings are. It is not impossible to arrive at an 
answer that is at least suggestive. 

Thus Skeels, Updegraff, Wellman, and Williams (g.v.) found 
that children benefited greatly by attendance at what they de- 
scribe as a reasonably good preschool that was organized in the 
Orphanage home where they had been placed. What, then, were 
the characteristics of this favorable environment? Preschool took 
most of the day. It began at 8 A.M. with vigorous play. Then there 
Were quieter activities. Tomato juice and codliver oil were served 
about 9:30. Then there was a rest period, followed by constructive 
activities, musical experiences, and a story or excursion, after 
Which the children washed themselves for the noon meal. After 
that there were naps. School continued until about 3 P.M., with 
books, music, and constructive activities generally, and excur- 
sions, etc., continuing until about five 0’clock. After that, supper 
and bed. K 

Skeels and Dye, again, reported the sensational and perhaps 
questionable gain of an average of 27.5 I.Q. points for 13 children 
in a “barren” orphanage, the gain being associated with a special 
environment organized within the institution. The essential fea- 
ture of this environment appears to have been special CALE for 
these children, organized by using older girls who were inmates 
of the institution and who were mostly feeble-minded, but who 
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3. The effect of later schooling 


There is substantial evidence that continuation in school is 
associated with improvement in mentality as revealed by test 
performance. Thus McConnell (g.v.) made a study of 70 college 
seniors who had been tested as freshmen on the 1927 edition of 
the American Council on Education Psychological Examination 
for College Freshman. As seniors they were tested on the 1928 
edition, and the 1927 scores were transmuted into 1928 
values. A mean gain of 40 points was found in the composite 
scores. Again, Thomson (g.v.) studied 106 students who had taken 
the 1935 edition of the American Council Examination as high 
School seniors in January 1937- In September of the same year 
they took the 1937 form of the test. The scores on the 1937 edition 
were transmuted into the values of the 1935 edition. A mean gain 
of 14.5 points was found. It should be noted that the elapsed time 
between testings was much less in Thomson’s study than in that 
of McConnell. So, too, Livesay (g.v.) found a mean gain of 44.8 
points on the American Council Test for 50 students in college 
over 4 years. And Rogers (g.v.) has reported similar results at 
Bryn Mawr. Also Barnes (g.v.) gave the American Council Test 
to 105 freshmen on admission, and to the same group at the end 
of sophomore year. He found a net mean gain of 25 points. 

The most important study on the issue is that of Lorge (1945). 
Its chief findings are summarized in Table 45. Lorge retested in 
I94T a group of 131 persons, all of whom had been members of a 
group of 863 tested in 1921, twenty years earlier. In the 104 
testing he used the Otis Self-Administering Test of Mental Abil- 
ity, Higher Examination, Form B, and Part III of the Thorndike 
Intelligence Examination for High School Graduates, Form V.In 
the 1921 testing the Thorndike-McCall Reading Scale and the 
L.E.R. Arithmetic Test had been used, and composite scores 
worked out. The scores so derived from these two tests measure 
about what is measured by a standard intelligence test. As will 
Be seen from Table 45; there was a substantial relationship be- 
testing after a lapse of twenty years, 


tween gains on the second k ই ট: 
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find that average Otis scores in 1941 for those who had completed 
8, 9, and 10 school grades are respectively 39, 38, and 37, but that 
the average Otis scores made in 1941 for those who had completed 
15, 16, and 17 school grades are 53.5 and 54.5. This amounts to 
an increment in their favor of 2 years of mental age on the Stan- 
ford-Binet scale. An examination of the tabulated results will 
show other evidence pointing towards the assocation of high 
1941 intelligence scores with continuation in school. Similar results 
were obtained for scores on the Thorndike test also. 

Garrett (1946) raises various critical points on the Lorge study. 
He contends that the method of tabulating exaggerates the gains, 
which, although real, are “modest,” and no cause for “smugness.” 
Also he objects to the translation of group test scores into age 
scores and I.Q.’s for adult subjects. 


4. General evaluation of results 


Anything that contributes to an understanding of the relation- 
ship between test performance and schooling is of great impor- 
tance for psychometrics, because of the close connection between 
the construction and validation of tests and the institutional 
environment of the school, which has already been noted. This, by 
all means, is the point on which one should concentrate for a 
proper appreciation of the work that has been discussed. Sweep- 
ing causal “explanations” should be treated with the greatest 
reserve. The evidence does not decisively support them, and they 
deflect attention from the really crucial issues. 

A. It seems clear that continuation in school and exposure to 
certain types of school environment, particularly in early life, are 
associated with superior mental test performance, and above all 
with the improvement of mental test performance. This is very 
far from a disparagement of mental tests. On the contrary, it 
indicates that their psychological content and implications are 
rich and meaningful. If they do not measure abstract and absolute 
capacities segregated from all external conditions, so much the 
better. If test performance is associated as an index in fairly 
reliable quantitative terms, with favorable and stimulating con- 
ditions of total living, this adds to rather than detracts from its 
meaningfulness. % 

B. The work discussed contains many intimations and implica- 
tions of the highest importance for educational organization. What 
Kind of school environment should be set up to foster mentality? 
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Why does the conventional school environment fail to do so? 
These are among the pertinent, broad, and vital questions raised. 
And it would seem that the proper use and interpretation of 


We persist in trying to focus everything on the issue of environ- 
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the white draft on Army Alpha was exceeded by 20% of pure 
Negroes, by 25% of Negroes with one-quarter white blood, by 
30% of mulattoes, and by 35% of quadroons. Also, in testing on 
a considerable scale in the state of Virginia, it was reported that 
the mean test performance of pure Negroes was 69.2% that of 
whites, the test performance of Negroes with one-quarter white 
blood was 73.29%, of mulattoes 81.2%, and of quadroons 97.8%. 
This has often been considered clear-cut evidence for hereditary 
racial differences. 

One great systematic difficulty of all such work, however, is 
always to determine the degree of racial purity or admixture. 
Ferguson’s criterion was skin color. He determined the racial com- 
position of his subjects by matching their skin pigmentation 
against color combinations containing known proportions of black, 
White, yellow, and red. It was assumed that the greater the pro- 
portion of black, the purer the Negro strain. This has proved 
entirely fallacious, for many Negroes with very dark skins are 
very far from being pure racial samples. The only way to ascertain 
racial admixture accurately, or even with a reasonable reliability, 
would be by a study of family hereditary. Quite apart from the 
labor involved, the data for this, consisting of records of marriages 
and births at the very least, simply do not exist, especially in the 
Case of Negroes and Indians. So Ferguson’s proposed hierarchy 
collapses. 

Nevertheless, the obtained differences in mental test perform- 
ance are undoubtedly real, and demand explanation. They pose 
Problems of high significance, both for the practical issues of race 
relationships and for the proper understanding of mental tests and 
Psychometric techniques. 


2. Nonracial factors 


From the very first it was recognized, at least dimly, that when 
mental tests are given to different racial groups, many nonracial 
factors will affect the results. These have been more and more 
stressed. Klineberg (1928, 1935 b) presents an excellent summary 
account of these factors, which has been elaborated by numerous 
other writers. 

A. The first and most obvious of these nonracial factors is that 
of language. Attention has often been called to it. Pintner and 
Keller (g.v.) used the Stanford-Binet scale, the Pintner Non- 
Language Group Test, and various performance tests, in testing 
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a population of varied racial groups in Youngstown, Pennsyl- 
vania. They found that children who hear a foreign language at 
home test lower on the Stanford-Binet and probably on any lin- 
guistic test than those who use English at home. This difficulty 
Ut it certainly is in 
the Pintner Non- 
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parisons. It is known that they are associated with mental test 
performance. But socioeconomic categories mean different things 
for different racial groups. Thus among Southern Negroes, only 
the most exceptional can become lawyers or physicians or large 
business enterprisers, and then they are by no means equated to 
whites in the same classifications. So a semiskilled Negro worker 
may represent a decidedly different socioeconomic level from that 
of a semiskilled white worker. Such factors may easily have more 
influence on test performance than racial inheritance itself. Beck- 
ham (g.v.) studied a population of 1,100 Negro boys and girls. 
Mean I.Q.’s for the upper socioeconomic levels ranged from 97 to 
101, and for the laboring groups in the low 90's. They were lower 
than those of whites in the upper brackets, but below the top 
classifications in socioeconomic status there was not much dif- 
ference. The same difficulties undoubtedly apply to most racial 
groups. 

D. The impact of schooling, which as we have seen is asso- 
ciated with test performance, is not identical for all racial groups. 
Negroes and Indians have fewer educational opportunities of any 
kind than whites. The selective effect of schooling upon them is 
both more stringent and different. The content and type of the 
educational environment available to them is inferior. Thus groups 
of various races classified on educational status are not fully 
commensurable. Also, the suggestion has been made that reactions 
to schooling are different, and that less privileged racial groups 
may take it more seriously and work harder (Ferguson, 1916). 

E. The problem of rapport, always important in mental 
measurement, becomes crucial in much testing of different racial 
groups. ‘The sophisticated Negro is apt to be suspicious, for he is 
aware that mental measurement has seemed to relegate him to in- 
feriority. The rural Negro is apt to be shy and fearful. Such 
influences and effects are often the cause of confusion, and of low 
test performance which is quite invalid because of the intrusion of 
destructive variable errors. There is, also, a wider problem of 
motivation, i.e., of willingness to be tested. 

F. The claim has been made from time to time that a speed 
factor enters into racial comparisons based on mental tests. 
Negroes, Indians, and members of other races are said to be 
slower in their reactions than whites. It is not clear that this 
factor is generally present, although in many special cases and 
situations it should undoubtedly be taken into consideration. Quite 
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probably speed of response is not a true hereditary racial factor. 
Klineberg (1928) compared groups of Negroes and Indians in this 


3. Conclusions 


From the various Considerations that have been discussed, cer- 
tain broad Conclusions have emerged that can be regarded as 


A. It is possible that true hereditary racial differences in men- 
tality and ability exist, but how and to What extent they reflect 
themselves in test performance is not known. Klineber 
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this one, reaches the Conclusion that differences in test perform- 


DE one 


ক 


APPLICATION OF MENTAL TESTS 333 


ance as between different racial groups can in the main be ex- 
plained by environment and opportunity. This, of course, is the 
essential point; and without entering into a discussion of causes, 
it may be said that such differential scores certainly reflect non- 
racial factors and the circumstances of life and do not separate out 
Or isolate hereditary racial factors with any clearness. 


TABLE 46 


STANFORD-BINET I.Q.’S OF NEGROES CLASSIFIED BY LENGTH OF 
RESIDENCE IN NEw YORK CiTy 


(Klineberg, 1935 a, P. 46) 


Group Classification N Average I.O. 
LESS Than) EK JEM ssanc Santas asin AEE AEA 42 81 
T= VEER 5 4506.004 5 EE « লে 40 84 
2-3 YEAS sess 40 85 
Ed, HOUTEN ib awacs eg 46 89 
More than 4 years ce. PENA 47 87 
New York born 99 87 
All southern born 215 85 


B. Hollingworth and Witty (g.v.), in their interpretive sum- 
mary, propose that instead of attempting to study the psychology 
of race, we deal rather with the psychology of specific census 
groups. These are actual functioning groups, such as the Cherokees 
of the Five Civilized Tribes of Oklahoma, the children of Welsh 
descent in Wilkes-Barre, Pennsylvania, and so on. Such census 
groups are efined in terms partly of ecology, partly of anthro- 
Pology, partly of political science. This is in line with the resolu- 
tion passed in 1939 by the American Anthropological Association 
at its New York meeting (g.v.), which was as follows: “(1) Race 
involves the inheritance of similar physical variations by large 
groups of mankind, but its psychological and cultured connota- 
tions, if they exist, have not been ascertained by science. (b) An- 
thropology provides no scientific basis for discrimination against 
any people on the ground of racial inferiority, religious affiliation, 
or linguistic heritage.” This implies that psychometric instru- 
ments, if the scores they yield are to have any meaning at all, 
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must be brought to bear upon and interpreted in terms of actual 
functioning groups of human beings. h We 
When so used they can be of great and authentic service in 
many ways in helping to deal with the complex and troubled 
issues of race. By way of a single example, M. D. Jenkins (q.0.) 
undertook a psychometric study of 8,400 Negro children in the 
Chicago schools. The ablest of this population were selected by 
nomination by their teachers, and of these again the ablest, num- 
bering in all 103, were measured by the Stanford-Binet scale. The 
results are shown in Table 47. It will be seen that all the intelli 


TABLE 47 


DisTRIBUTION OF I.Q.’s oF IO3 SUPERIOR NEGRO CHILDREN 
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tremely high, and that the 
and obviously important i fted Negro is no 
anomaly, and that there is much Unrecognized and wasted Negro 
ability. Along the same line Witty (1934), in a testing survey of 
27,643 Negro pupils, found 33% of all boys and 35% of all girls 
with I.Q.’s in excess of 140. 

Thus the proper and instructed use of tests is not to demon- 
Strate racial inferiority or Superiority, Ut, as Stoddard well points 
out, to discover and foster ability and to Uncover the facts He 
sary both for research and for Practical decisions. And as he also 
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remarks, intelligence tests should be supplemented by the use 
of the best obtainable measures of “stability and special apti- 
tude.” 


GENERAL CONCLUSION 


It is clear that the four avenues of applied mental measurement 
discussed in this chapter have brought to light findings not only 
significant in themselves but of the first importance to psychomet- 
rics itself. The basic logic embodied in psychometric instruments 
is to set up a working conception of some ability or process, to 
select test items with respect to it, and to interpret responses to 
these items in terms of the performance of a standardization 
group taken as representative of the distribution of the ability or 
process in a much larger population or universe of discourse. The 
obvious difficulty is the representative character of the standardi- 
zation sample. If it is drawn from one census group, or one racial 
group, or one socioeconomic level, or from persons with a certain 
stated amount of schooling, the norms it yields may lead to falsi- 
fication when they are applied elsewhere. Even when, as with the 
Revised Stanford-Binet, the standardization sample is selected in 
Proportion to occupational distribution in the country as a whole, 
it may easily be biased in other respects. In any case, such a group 
is always an average sampling of many subgroups, and thus there 
is a danger of falsification when the norms are applied to any one 
of them. Such distortions and misunderstandings of norms need 
not occur. But if they are not to do so, we must constantly bear 
in mind the real basis of comparison and evaluation on which the 
test is built. 

The additional major point which emerges from this chapter is 
that the results of tests so constructed invariably reflect the total 
circumstances of life that surround and affect the persons tested, 
and cannot be isolated from them. This enhances rather than 
detracts from the psychological content, meaning, and usefulness 
of tests. But it must never be forgotten. If test scores are taken 
as revealing absolute, isolated mental components, the same for 
all human beings irrespective of circumstance, the most deplorable 
misinterpretations are certain. If we remember that a mental test 
is an instrument for evaluating an unknown individual, the sub- 
ject, with reference to the performance of a standardization group 
taken as a sample, and to make the evaluation conveniently as a 
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numerical score, and if also we remember that the subject's per- 
formance will reflect in some measure all the actual conditions 
of his life, then psychometric instruments can be used for highly 
constructive purposes. They can reveal much that would other- 
wise go unrecognized and unknown, and the numerical scores they 
afford can be guide lines in the endeavor to understand and to 
better human conditions. 

Nor can it be said with justification that such tests are merely 
ad hoc practical instruments, without theoretical validity or sig- 
nificance. The performance of a standardization sample on well- 
Chosen items focusing in a well-selected concept really does rep- 
resent the functioning of the mental Process or ability so con- 
ceived in a particular setting. If it were possible to get at the 
basic components of human mentality in the abstract and irre- 
spective of circumstance, Perhaps far better tests could be made. 
But the science of psychology has not progressed so far, and per- 
haps it never will. Perhaps its true task will always be the study 
of human mentality as it actually manifests itself in innumerable 
settings; and if this is so, the universal test unaffected by socio- 
economic, or family, or educational factors, and dealing with 


something which is constant for all Sroups, is not only an unattain- 
able but also a false ideal. 
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QUESTIONS FOR DISCUSSION 


. I. Consider what inferences can be drawn from the fact that verbal 
intelligence tests “favor” socioeconomically privileged groups and 
urban groups. 

2, What characteristics of a family and home environment would 
you expect to be associated chiefly with superior test performance? 
What characteristics would probably not be associated at all with it? 
What characteristics might be associated with poor test performance? 

3. To what extent do you consider the limited conclusions drawn 
by Carter in the item above in the additional readings regarding the 
effect of different environments on twin resemblance compatible with 
the findings of Newman, Freeman, and Holzinger? 

4. It is usually believed that an ordinary school environment has 
little effect in producing better mental test performance. If this is 
true, what reasons can you find to explain it? 

5. Consider the bearing of the material in this chapter upon the 
problem of test validation, i.e., the determination of what a test 
really measures. 

6. If test performance is responsive to the various types of influ- 
ence discussed in this chapter, might this be used as an argument to 
show that mental tests are worthless? How might such an argument 
be framed? 

7. Examine in detail the various criticisms made by Goodenough 
in the item listed above among the additional readings. How far do 
they seem to you correct in view of the data and contentions she 
refers to? What answers, if any, can you find to them? 

8. Examine the defense put forward by Stoddard in the item listed 
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above among the additional readings. To what extent ‘does he seem 
to summarize and interpret correctly the studies to which he refers? 
Can you find any replies to his contentions? 

9. Consider the idea that mental tests reveal absolute, abstract, 
isolated mental functions and abilities as such. What reasons are 
there for and against such a view? What would be some of the theo- 
retical and practical consequences of maintaining it? 

IO. If certain racial groups and certain socioeconomic groups make 
better responses relatively to performance tests than to verbal tests, 


does this mean that performance tests are superior to verbal tests? 
If not, what does it mean? 


CHAPTER X 


WIDER PSYCHOLOGICAL ISSUES IN MENTAL 
TESTING 


INTRODUCTION 


Having sought to discover what light is thrown upon the psy- 
chological content and meaning of mental tests by the work that 
has been done in their major areas of application, we now turn 
to certain wider and more general psychological issues involved 
in them. These are the problem of the constancy of mental traits, 
the problem of the nature and course of mental growth, the prob- 
lem of the distribution of mental traits, and the problem of heredi- 
tary and environmental influences in determining mentality. All 
these four problems, together with the topics discussed in the last 

. chapter, come to a focus in one culminating and inclusive prob- 
lem; namely, that of the psychological significance of test per- 
formance, which is sometimes formulated, though in an. unduly 
limited way, as the significance of deviates. 


THE CoNSTANCY OF MENTAL TRAITS 


1. The problem 


Granted that a person displays a certain pattern of mental 
abilities, is he likely to retain that same pattern over different 
and perhaps extended periods of time and under different circum- 
stances? Is a person’s mentality likely to stay the same, or is it 
likely to change? This is the problem of the constancy of mental 
traits or abilities. 

It is one of the most generally misunderstood of all topics 
connected with mental measurement. First, it is continually con- 
fused by being identified with the question of the constancy of 
the intelligence quotient. But this is putting the cart before the 
horse. The intelligence quotient is nothing more than one of the 
indices or units often used for expressing one of the aspects of 
mentality in numerical terms. Granted that itis a properly chosen 
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and derived unit, then if a person’s mentality remains the same 
over a period of time, his I.Q. also should stay the same. But if 
his mentality changes, then so should his I.Q. So the issue is with 
the individual’s pattern of mental abilities and its tendency to 
remain the same or to change; and the enormous mass of data 
which has been accumulated on the constancy of the I.Q. is sig- 


nificant chiefly because it constitutes evidence bearing upon that 
issue. 
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and carefully evaluated test performance. If, on the other hand, 
it is subject to substantial alteration, then frequent re-surveys are 
in order. If certain circumstances do not affect it and other circum- 
stances do, then it is important to know what these circumstances 
are. But once again what we have to deal with are not general 
theories, but matters of fact on which wise practical decisions can 
be based. 


2. The constancy problem and the constancy of the I.O. 


The bulk of available evidence, though not all of it, regarding 
the constancy of mental abilities is contained in the studies on 
the constancy of the intelligence quotient. The outcomes of these 
studies will be summarized, and their significance interpreted as 
compactly as possible. 

A. The broad evidence is that under ordinary circumstances 
the intelligence quotient is constant within certain clearly defined 
limits. In Table 48 are shown the data which Terman assembled 
many years ago on this point. He summarizes them as follows. 
“(1) The central tendency of change is represented by an increase 
of 1.7 in I.Q. (2) The middle fifty percent of change lies between 
the limits of 3.3 decrease and 5.7 increase. (3) The probable error 
of a prediction based on the first test is 4.5 points in terms of 
I.Q.” (Terman and Others, p. 142). Rugg and Colloton (g.v.), sum- 
marizing the results obtained up to 1927 on the Binet tests, involv- 
ing large numbers of subjects, find that for half the cases over 
unspecified periods of time the average changes are less than 6 
Points increase and less than 3 points decrease. The results re- 
ported by Gray and Marsden (g.v.), who summarized studies 
using the Stanford-Binet scale, are shown in Table 49. They find 
that the central tendency of change is 42.25 points, and that the 
middle so% of all changes lie between 7.7 increase and 2.25 de- 
crease. Baldwin and Stecher (g.v.) used the Stanford-Binet scale 
with 485 children, and report that most of the changes on retest- 
ing were within 5 points and that correlations between test and 
retest were from .72 to .94. The latter coefficient is much more 
representative than the former of many others that have been 
reported. Thus L. S. Rugg (g.v.) gave 114 pairs of Binet tests at 
intervals of from 4 to 36 months to an unselected group of chil- 
dren with I.Q.’s from 73 to 133 and found a correlation between 
testing and retesting of .948. However, caution must be observed 
in interpreting these high correlations. They do not mean neces- 


342 PSYCHOLOGICAL TESTING 


TABLE 4S 


CHANGES IN I.Q. AMONG 31 5 PERSONS RETESTED OVER DisrERENT 
PERIODS UP TO SEVEN YEARS 


(From Terman et al, 1917, Table 25, p. 141) 


| 


INCREASES DEcREASES 

Extent of change N Extent of change থু 
2 X 
5 2 
I [o) 
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3 2 
3 3 
3 2 
য 7 
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20 7 S55 des sansa 29 
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sarily that the ILQ.’s have not Changed, but Only that individuals 
tend strongly to maintain their relati itions. 


To sum up, when a Child is tested 
between the ages of about 4 to 15, these being the ao 
the scale gives the most stable results, there is a 50% probability 
that his L.Q. will remain constant within 5 points or so up or 
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TABLE 49 


CHANGES IN I.Q.’S FOR VARIOUS TIME INTERVALS 
(From Gray and Marsden, tabulation by Nemzek, 1933 b, Table 2, Pp. 144) 


Range Semi-inter- রি Interval 
Testings| N ip of ts iddle quartile ro in years 
50% of range of between 
differences | changes Slings testings 
T and 2| 100 |.89 £ .014 |—2.25 to 7.66 4.95 2.25 
2 and 3 55 |.91 + .016 | —3.03 to 3.00 3.01 0.00 
I and 3 63 |.84 £.059 |—1.00 to 7.25 4.12 3.50 2 
all 218 |.88 + .036 |—2.70to 7.00] 4.85 1.60 I-2 
1 and 2| 100 |.88 + .015 |—2.25 to 7.70 5.00 2.25 I 
I and 4| 371 |.85 £.orr I-3 
I and 6 | 6716 |.85 + .008 | —6.10 to 4.70 5.50 —I.30 I-5 


down over an unspecified period of time. This, it will be noticed, 
is a very carefully qualified statement, and it opens up a number 
of further problems. 

B. For one thing, if the middle 509% of changes are likely to 
be within an approximate range of 5 points more or less up or 
down,’ this leaves 50% of changes that are more extensive. How 
large are they likely to be? A considerable number of quite exten- 
Sive ones are to be expected. Thus Psyche Cattell (1037), in con- 
nection with the retesting of 1,300 children with the Stanford- 
Binet scale, involving 3,33t comparisons in all, reports that 4 
individuals, or .3% of the group, made changes of over 40 points, 
that 1% of the group gained 3o points or over, that 5% gained 
20 points or over, that 10% gained 15 points or over, and that 
25% gained 8 points or over. Nemzek (1933 a) and Robert Thorn- 
dike (1940) also report a considerable number of large changes 
in L.Q., as will be seen from Tables 50 and 51. Two of Thorndike’s 
Cases made the sensational gain of 50 points, and one of them 
the almost equally sensational loss of 45 points, while gains re- 
ported by Nemzek ranged to 32 points and losses to 22 points. 


* The reader should realize that this specific figure is a rough one. The obtained 
limits of the middle 50% for various studies have already been reported. 
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TABLE 50 


I.Q. CHANGES IN 52 CHILDREN RETESTED AFTER 2 YEARS 


(Nemzek, 1933 a, Tables 2, 3, P. 476) 


GAINS LossEs 
STANFORD-BINET ...ese.ceees...| Range 2-32 pts. I-21 pts. 
Median 8.93 6.00 
Mean 10.82 6.22 
Number | 33 18 
IHEREING-BINER avai csc. .40 30 Range I-22 2-19 
Median 10.63 6.00 
Mean 10.66 7-72 
Number | 33 I8 


TABLE 51 


CHANGES IN I.Q. AMONG 1,TOO PRIVATE SCHOOL CHILDREN AFTER 
2H YEARS 


(After Robert Thomdike, 1940) 


INCREASE DECREASE 

Points N Points N 
50 2 
45 3 45 I 
40 {i 40 I 
35 II 35 6 
30 IS 30 4 
25 45 25 ‘ce 8 
20 60 20 ee | 2% 

EE OE LEE” 


The changes recently reported by Hirt (g.v.) are less extensive. 


Tests and retests with the 1916 Stanford-Binet were run on 1357 
Cases, at varying but substantial time intervals. Over 46% varied 
less than 6 points, almost 75% less than 11 points, less than 10% 


WIDER PSYCHOLOGICAL ISSUES 345 


more than 15 points. The most striking change was a drop of 38 
points, from 68 to 30, between C.A. 8-3 and 17-10, this accom- 
panying a case history of epilepsy secondary to organic lesion. 
More generally, Hildreth (1043) finds that the average retest 
scores on the Stanford-Binet of children of 130 I.Q. or more tend 
to run higher. And Mildred Allen (1944, 1945) finds that Kubl- 
mann-Binet testings in grade r are not closely related either to 
acadeniic achievement or indices of intelligence in grade 4. 

Three comments are in order in regard to these extensive 
changes and their frequency. (a) The facts seem to call for some 
revision of Terman’s earlier opinion that I.Q. increases of 20 
points or more were to be expected only in 1 or 2 cases per 1,000. 
Apparently they occur much more often and go far beyond 20 
points. However, his claim that half of all changes are likely to 
be between about 5 or 6 points advance and 4 or 5 points decrease 
Still stands. (b) Such changes in no way invalidate the IL.Q. as a 
unit of measurement, any more than a balance is invalidated 
When it records a change of weight. The question of the validity 
Or, better, the stability of the IQ. as a unit of measurement de- 
pends on quite other considerations. (c) It must not be supposed 
that because large changes are very much less common than small 
Ones, they are therefore unimportant and to be disregarded. In 
science it constantly happens that an unexpected deviation that 
only occurs once in thousands of times opens up a whole new area 
of investigation and explanation. The relatively high frequency 
of small I.Q. changes is, of course, an important datum. But large 
changes, although comparatively infrequent, are data just as im- 
portant. No account of the constancy problem which disregards 
them can possibly be well founded. 

C. The constancy of the I.Q. is regularly affected by certain 
conditions associated with the administration and statistical char- 
acter of the tests themselves. 

(a) The variability of the I.Q increases with the increase of 
time between testing and retesting. Robert Thorndike (1933), in 
an elaborate statistical study on the effect of time interval on 
Sianford-Binet I.Q.’s, concludes that a correlation of .889 is to 
be expected between quotients based on testings following one 
another immediately, a correlation of .814 when the interval is 
30 months, and a correlation of .698 when the interval is 60 
months. R. R. Brown (1933 b) finds that a similar but somewhat 
smaller drop in constancy is to be expected. The correlation of 
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? ined with a time interval of 1 year is .86, and with a 
EE of 9 years is .78. An average change of I.Q. points 
over a year is = 5.36, and over 9 years = 9.34. Brown (1933 a), 
in another study based on Stanford-Binet I.Q.'s of 124 problem 
children, finds that mean changes over periods from 60 to 145 
months are twice as great as those over periods of I to 24 months. 

(b) It has been reported from time to time that high IQ's tend 
to rise and low ones to fall. Thus Psyche Cattell (1931) established 
this trend in the testing with the Stanford-Binet scale of 1,183 
children, which was repeated at least twice with each child, at 
time intervals up to 72 months. For time intervals of longer than 
6 months, which meant the virtual elimination of practice effect, 
those in her highest I.Q. category made a mean gain of 16.0 points, 
and those in her lowest category made a mean loss of 7.5 points. 
Also, Goodenough (1928 c), using the Kuhlmann-Binet scale in a 
study of 380 young children, found consistent increases in mean 
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and to the unreliability of tests for young children. This last 
factor, however, cannot entirely explain changes trending strongly 
in one direction, such as those reported in connection with the 
effects of preschool attendance. 

D. This final comment opens up a question which may well 
have been already in the reader’s mind. What of changes in the 
I.Q., either positive or negative, associated with some special set 
of circumstances such as those considered in the last chapter— 
favored or unfavored socioeconomic status, special educational 
stimulation, continuation in school, good or bad home conditions, 
and the like? The point to understand is this: All such special 
changes are significant only as deviations beyond the range of 
expected constancy. If the intelligence quotient were not constant 
within the limits and with the qualifications already set forth, 
fluctuations and changes associated with this or that particular 
condition would be in no way striking and would suggest no prob- 
lems and call for no explanations. 

E. The general conclusion, then, must be that the massive data 
in regard to the intelligence quotient indicate a considerable but 
by no means absolute constancy of mentality. They indicate the 
limits within which changes are likely to take place. And when 
greater changes seem to occur, a problem immediately arises. 


3. Other psychometric evidence 


Outside of the studies on the constancy of the I.Q., the psycho- 
metric evidence on the constancy of mental traits and abilities is 
rather scanty and not very precise. However, it is without doubt 
confirmatory. Thus Hollingworth and Kaunitz (g.v.) found that 
82% of a group of 116 children in the top centile in I.Q. of all 
those tested on the Stanford-Binet scale were in the top cen- 
tile 10 years later when measured with the L.E.R. Intelligence 
Scale CAVD and Army Intelligence Examination Alpha. Lamson 
(1930), too, has strikingly shown the maintenance of intellectual 
status. She studied 56 gifted pupils who were in special oppor- 
tunity classes in elementary school, following them through high 
school, and comparing each child with a paired control from the 
same elementary school and of the same sex and grade classifica- 
tion. In regard to the determination of their intelligence, they 
were tested three times with the Stanford-Binet scale in elemen- 
tary school, and twice with Army Alpha in high school. The mean 
I.Q. of the gifted group was 155 and the range was from 135 to 
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t testing with Army Alpha was at a mean chrono- 
I Io years Iz months. At this age 33% of them scored 
in, the “A” classification, which as may be seen from Table gq 
means a test performance equivalent to that of the top 5.14% of 
the white draft in World War I. The second Alpha testing Was 
at the age of 15, and on this occasion all of them Scored in the 
“A” category. Just how significant this is may be gathered from 
Table 52. No other group there listed scores 100% A. The com- 
parison between these children and the group of graduate students 
is particularly striking. This study of Lamson’s is a typical fol- 
low-up investigation, and other work of the same kind confirms 
the finding that mental status tends to be retained. 


TABLE 52 


PERFORMANCE OF GIFTED HicH SCHOOL GROUP oN ARMY ALPHA 
COMPARED TO THAT oF OTHER GROUPS 


(Lamson, 1930, Table 5) 


Percent 
Group measured getting “A” N Mean C.A. 
Alpha scores 

SEN = ta 
Hollywood High School Seniors. . 34.1 2II 

High School Seniors... 37.7 635 174 
University Students ..... ডি 51.5 5950 

Library Personnel ...... 60.0 296 

Oberlin College Freshmen 70.0 330 

Graduate Students 81.0 252 

Hotchkiss Seniors .. 81.0 75 

Yale Freshmen ..... 85.5 400 

Gifted GrOUp «os teas ten all 1.00.0 54 15.0 
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seem usually to amount to stronger feelings of liking for what is 
earlier liked very well, and decreasing inclination for what is not 
liked very well in earlier years. 


4. Evidence from beyond psychometrics 


So far as the present writer’s knowledge goes, the general prob- 
lem of the constancy of mental traits and abilities has not received 
much attention in psychology outside the field of psychometrics. 
There are at least three apparent reasons which would make this 
understandable. First, it is natural and obvious to assume that 
the general mental make-up of any individual will remain about 
the same throughout his life. This corresponds to ordinary experi- 
ence, and the assumption has been accepted without elaborate 
investigative proof. Second, in most fields of psychological re- 
search the problem of constancy is not paramount. It only becomes 
so when we are trying to construct measuring instruments and to 
devise criteria which can be applied to an individual at various 
ages. Third, radical and striking changes in the personality and 
mentality of human beings have not forced themselves upon the 
attention of psychologists, presumably because they do not often 
occur. If they were common, it is safe to say that they would 
have been made the subject of investigation long before today. 

Of course, the whole psychology of learning bears upon the sub- 
ject. It is known that (a) human beings can certainly learn a 
great many more things than they do learn, (b) the effects of 
learning can be lasting rather than superficial and transient, (c) 
the capacity to acquire new abilities continues until late in life. 
Human beings, in other words, are highly adaptive, and the actual 
extent of any individual’s adaptation depends on circumstances 
and is never fully actualized. Thus mentality is very far from 
fixed or absolutely constant. No doubt, however, learning capacity 
and adaptability have definite limits, though just what they are 
and what determines them is not well understood. 


3. Conclusion 

All the evidence on this problem hangs together quite consist- 
ently. The findings on the constancy of the intelligence quotient 
are quite in line with other psychometric evidence and with the 
general trend of psychological thought and investigation. Those 
findings have dealt with the problem in specific quantitative terms. 
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They indicate that in the majority of cases, and under ordinary 
circumstances, the mentality of human beings may vary some- 
what but not a great deal, that variability depends upon the length 
of elapsed time, the nature of the tests used, and the age at which 
mental status is determined, but that in a significant and not 
inconsiderable number of cases large though not unlimited varia- 
tions can take place. In all probability this corresponds reasonably 
well to the true facts. It in no way prejudges the issue of whether 
mentality can be changed by properly chosen influences, favorable 
or unfavorable, consciously brought to bear. Again, in all proba- 
bility this can be done, in view of our total knowledge of the 
Psychology of learning and of the phenomena of mental growth, 
which is the next major topic to consider here. But once more, 
the possibilities of such change are far from limitless. 

This seems to be a reasonable position, in the light of all the 
evidence. It in no way undermines the foundations of psycho- 
metrics, for it provides a quite sufficient basis for the construction 
of significant mental tests. Those tests do not measure phenomena 
that are invariant or absolutely fixed, or indeed nearly so. But 
such a supposition would surely be intrinsically untenable, con- 
sidering that we are dealing with the living, changing, adaptive 
human being. However, human nature is by no means so fluctuat- 


ing and uncertain as to render all attempts at measurement and 
prediction futile. 


MENTAL GROowTH 


The topic of mental growth is in a sense complementary to 
that of constancy, for it has to do with the sequential changes 


that take place in a person’s mentality and behavior patterns 
during the course of his life. Those changes are due partly to 
organic maturation, and partly to experience and learning. The 
specific relevance of this topic to Psychometrics lies in the fact 
that many tests and measures offer interpretations based upon 
the relationship between chronological age and mental ability. Its 
broader relevance turns upon the question of whether and to 
what extent psychometric instruments can give an account of the 
changes in the way of mental development and decline within the 
constant framework of the person’s individuality. The subject 
itself is, of course, a very large one, to which a 


aE Very great number 
of investigations have been devoted. Only those aspects of it 


WIDER PSYCHOLOGICAL ISSUES 351 


which are directly related to psychometric problems will be sum- 
marized here. 


1. Early mental growth 


Early mental growth is characterized by the rapid emergence 
of new functions and differentiations. Gesell and his associates 
(v. Gesell and Amatruda), to whose work reference has already 
been made, have carried out elaborate studies that reveal some 
of these changes. Thus an infant at the age of 1 month lifts his 
head from time to time when held at the experimenter’s shoulder. 
At 8 months he sits momentarily without support. And at 21 
months he walks attended on the street. With regard to language, 
at t month he gives definite heed to sound. At 8 months he gives 
vocal expression to recognition. At 21 months he repeats things 
that are said. With regard to what Gesell calls adaptive behavior, 
at 1 month he stares at a massive object presented in his field of 
vision. At 8 months he definitely looks for an object fallen on the 
floor. At 271 months he differentiates between a toy tower and a 
toy bridge. These are a few sample items taken from Gesell’s 
developmental schedules, and from them the reader can gain some 
impression of the complex and varied differentiations and inte- 
grations referred to by the term mental growth. 

Such studies are of great importance. They support and amplify 
our general notions as to the nature of the growth process, and 
they reveal its specific phenomena. But they also make it clear 
that the growth concept is one that involves very serious psycho- 
metric difficulties. 

A. In the first place, the term growth evidently includes the 
evolution of more or less separable functions which have their 
own developmental rhythms. Thus is it necessary to distinguish 
between motor development and linguistic development. As Bayley 
(1933) and others have shown, motor development proceeds more 
rapidly and decisively during the early months of life than lin- 
guistic development. Also, it is necessary to consider as a more 
Or less separate category what Gesell calls adaptive behavior. 

But this is by no means all. Each of these broad divisions 
Probably contains within itself an indeterminate number of sub- 
divisions that are just as real and important. Is it, for instance, 
legitimate to assume that the child’s linguistic reactions as de- 
scribed by Gesell at the age of 1 month are psychologically con- 
tinuous and homogeneous with those at 21 months? What is 
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lumped together as language is really a very complex constellation 
of functions. Thus Lewis (g.v.) in his study of infant speech 
shows that a few hours after birth comfort and discomfort sounds 
can be discriminated. At 2 to 3 months there appear babblings, 
i.e., comfort sounds pleasant to make for their own sake, which 
according to the interesting suggestion of Lewis may be the origin 
of aesthetic interest in speech and sound. Between 1 and 4 months 
there is much rough imitation of adult speech sounds, but after 
4 months imitation seems to become much rarer, because meaning 
begins to be prepotent. After 6 months, however, imitation re- 
appears accompanied by echolalia. Then there is a rapid accumu- 
lation of phonetic forms and new concepts. Stumpf (q.0.), again, 
finds that early vocalization is neither speech nor song, but the 
matrix of both, speech moving in the direction of i 
trol by symbolic meaning, and song in the direction of increasing 
control by pitch. So, too, Gesell’s adaptive behavior seems hardly 
more than a topical category covering many real differences. 

A good diagnostic and predictive account of a child's early 
growth will give great weight to the relationships manifested in 
the development of these discriminable functions. General retarda- 
tion is an unfavorable sign. The Precedence of speech to walking 
is usually a very favorable one, and so on (v. Gesell, 1940). But 
it becomes very much of a question whether a global over-all 
index such as mental age can ever represent the significant phe- 
nomena of early growth. Such a measure or index Can only be an 
average, concealing and containing within itself many important 
differences. This is a major problem in the construction and inter- 
pretation of mental tests for young children; and, as we have 
seen, criticism of the use of global indices extends also to their 
use far beyond infant levels. 

B. It is usually believed that mental 
and years of life is very rapid. Certain! 
who watches a growing infant. And 
general phenomena of organic growth. 
is often made. Thus Terman and his 


ncreasing con- 


for telling just how rapid early growt 
growth. Indeed, some growth curves 
Gesell (1929) and Bayley (1933), 
tion, i.e., a slow initial advance that 
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plotted in such a way as to make this inevitable. Rapid early 
mental growth is probably the reality, but exact quantification is 
lacking. 

C. It has often been claimed that early growth is peculiarly 
subject to environmental influence. Again, this is quite probably 
true in some important sense, but we do not know with any 
exactitude in what sense. Older persons can certainly learn and 
change. But perhaps the changes cannot be as profound in some 
Way or other as those that can take place early in life. Psycho- 
analysis assumes the prepotency of early influences upon develop- 
ment, and no doubt with justice. The argument has been used to 
explain the remarkable effects claimed for preschool attendance, 
and it is not implausible. But we have no conclusive proof of its 
truth and no quantitative definition of its meaning. 


2. The continuation and culmination of mental growth 


Two problems arise here. The first is the significance of the 
So-called age of arrest. The second is the nature and reality of 
mental growth continuing into the later adult years. 

A. The age of arrest is primarily a psychometric rather than 
a general developmental concept. It is important that this should 
be clearly understood. It means the chronological level at which 
the regular mean improvement of test scores with increasing age 
ceases to manifest itself. The issue is an old one. Terman (1917), 
with his standardization group for the upper levels of the Stanford 
Revision of the Binet scale, which consisted of “normal adults,” 
i.e., businessmen and older high school pupils, found that this took 
Place at about 16 years. So this was recommended as the age of 
arrest, to be used as the denominator in computing the intelligence 
quotients of older persons. However, Yerkes (1927) found that a 
representative sample of the native white drafe in World War I did 
not go beyond a mean mental age of 13.42 years on the Stanford- 
Binet scale. This led him to argue that Terman had placed the age 
of arrest too high because of special selection in the standardization 
group he used. Adult I.Q-.’s, it was suggested, should be computed 
on a base line of about 14 C.A., which of course would raise them 
very materially. Terman himself (1927) has attacked this conten- 
tion, saying that the Army tests were unsuitable, that they were 
administered to bewildered and disoriented men, and so forth. 
‘The problem has been found very challenging, and numerous 
investigators have devoted attention to it. Thus Thorndike (1923, 
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rly intervals to large groups of subjects 
2) RO and found advancing Scores up to the 
latter age. Teagarden (1924), again, gave four intelligence tests to 
408 subjects from 12 to 20 years old, and found a steady Wee 
in scores up to 18, with no reason to suppose that for norma 
individuals mental advance was arrested even then. 


t of further Work the 


explanation clearly is that the battery dealt with functions which 


Show this type of developmental Sequence. 

Attempts have been Made with some SUCcess to show what 
mental functions are affected by age. Thus Jones and Conrad 
(g.0.) gave Army Group Intelligence Examination Alpha to 1,191 
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residents of New England villages, ranging in age from 10 to 60. 
Average scores based on all 10 subtests showed a peak at 19, and 
then a smooth drop. But the effect of the subtests was differential. 
The Common Sense and Analogies subtests showed the sharpest 
drop. Arithmetical Problems, Number Series Completion, and 
Scrambled Sentences showed some. Opposites and Information 
Showed none at all. 

Willoughby (g.v.) obtained mean scores on Ir tests for age 
groups up to 60. All the averages showed a rise into the late teens 
and early twenties. Then there was a decline for those involving 
Series Completion, Verbal Analogies, Verbal Opposites, Substitu- 
tion Learning (digit-symbol), and Information in History and 
Literature. Arithmetical Reasoning, however, showed no decline. 

Sorenson (g.v.) gave tests in vocabulary and paragraph reading 
to between five and six thousand adults. Of this total population 
he selected 64r to provide uniform groupings at five-year inter- 
vals from 16 to 70. The members of each of these five-year groups 
were selected to correspond to the group 50-54 years of age in 
years of schooling and occupational status. The great importance 
in holding these two factors constant in dealing with adults, if 
comparisons of test performance are to be meaningful, is very 
evident. If, for instance, our adult groups below 30 are strikingly 
Superior in education and socioeconomic status to those above 40, 
an apparent decline of intelligence is extremely probable. With 
these two factors held constant, Sorenson found that mean vocabu- 
lary scores improved throughout the age range, and that para- 
graph reading was maintained at an even level. He argues that 
alleged and demonstrated declines in mean test scores at later 
levels are largely due to subjects getting more and more out of 
practice with test materials as they grow older. 

Weissenburg, Roe, and McBride (g.v.), too, have made an 
elaborate study of the problem, some of the details of which are 
presented in Table 53. Their work shows very clearly that tests 
differ in suitability for adults, sentence completion apparently 
being a good one, the later Stanford-Binet subtests standing up 
well, vocabulary also being a good one. Performance tests, on the 
other hand, are generally poor adult material. So is the Good- 
enough Drawing Scale. On the whole, according to their results, 
language ability is well maintained with age, but other functions 


tend to show a peak and a decline. 


Christian and Paterson (q.v.) gave a vocabulary test of 120 
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items to a group of university freshmen and to another group 
made up of their relatives and friends. The differences which are 
shown in Table 54, Slightly in favor of the younger group, are 
not statistically significant. The authors also point out that the 
younger group was somewhat more highly selected for intelli- 
gence. Furthermore, the test involved a speed factor, and when 


TABLE 54 
RECOGNITION VOCABULARY SCORES OF VARIOUS AGE GROUPS 


(Christian and Paterson, p. 168) 


= 
Age Group N Median | S.D. Range 
18. YEA OOS: i oatwatas EEE 200 88 IT 4I-IIS 
40-49 year OldS ...occscecees 50 84 22 40-116 
50-59 Year Olds sess. 8 SEH 40 82 26 25-115 
60-69 Year Olds sr.cw ais ass 30 79 29 IO-II4 


this was cancelled out the slight advantage of the student group 
disappeared. As a piece of research this is somewhat slight. But 
in conjunction with much confirmatory evidence its emphasis upon 
the vocabulary test as suitable for adults is significant and in- 
dicative. 

W. R. Miles (g.v.) presents a general report on the Stanford 
Later Maturity Study. The work in 1930 was based on 863 persons 
aged from 6 to 95. The work in 1932 was based on 1,600 cases. Age 
groupings as follows were set up: B, 10-17; C, 18-29; D, 30-49; 
E, 50-69; F, 70-89. The averages show definite general declines 
for the older age groups. On a maze test the scores for the five 
groups were respectively 95, 100, 92, 83, 55. When a single rather 
experimental test intended to measure imaginative capacity was 
given, there was virtually no change in the mean scores for the 
various groups. Tests requiring the higher types of intellectual 
effort show in general a late maturity and a slow decline. Thus on 
a learning test the mean scores for the five groups were respec- 
tively 72, 100, 100, 87, 69. Comparison and abstraction as meas- 
ured by the ordinary types of test items show a peak at about 18, 
and then a decline. But according to Miles the function itself is 


1 
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ich i ly probable. All the aver- 
i ly measured, which is extreme ly 
TE to be interpreted in the light of the fact that ন 
i Nong were large at all age levels, and that there Was muc 
RR Ping. To give an idea of how extensive this Was, 25% of 
ie oldest group equaled or excelled the over-all adult average, 
even with speed as a factor. 


Figure 27 reproduces the curve of mental decline developed by 
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memory loss in old age (Gilbert, 1941). But reasoning ability and 
language functions may advance for many years and show only 
a very late decline (Gilbert, 1935). Moreover, it is clear that for 
any stable and dependable conclusions, the selection of older per- 
Sons is of paramount importance, for cumulative differences in 
occupation, mode of life, general stimulation, and alertness can 
easily falsify a set of test findings. Moreover, too, in dealing with 
adult mentality, motivation is of great importance. It has been 
shown that among eminent persons the greatest creative output in 
literature and science is between the ages of 25 and 40. But as 
Lehman (g.v.), on whose work the above statement is based, 
Points out, these are the years when such persons are establishing 
their reputations and working hard to build a career. The infer- 
ence for psychometrics simply is that a test situation which be- 
cause of its setting and content is found ridiculous, or childish, or 
trivial by adult subjects, is almost certain greatly to misrepresent 
the truth as to their mental abilities. 

It is evident that this material is full of meaning for Psy- 
Chometrics. It is entirely possible to construct effective adult 
mental tests. They must consist of suitable content. But what 
Such content needs to be is fairly clear. And such tests must not 
yield indices depending on very fine and small age classifications. 
On the other hand, all that has been said makes it clear why tests 
constructed and standardized primarily for young people in school 
may well show an early age of arrest, and why they tend to suggest 
a false picture of mental growth beyond the late teens. 


3. The growth curve 


For psychometrics the most fundamental question in this whole 
area is whether it is possible to construct a curve representing the 
true average or “normal” course of mental growth over a con- 
siderable age range. If this could be done, it would be of the 
highest importance. Test performance could be interpreted by 
indices and scores that would show known increments and levels 
of mental growth, instead of by mental age scores or standard 
Scores whose developmental values are uncertain. It would be pos- 
sible to tell whether a person's mental development was proceeding 
more or less rapidly than normal, and whether he had reached a 
stage of maturity at, or above, or below the expected norm for his 
age. This very thing, as we have seen, has been attempted by 
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Fic. 28. THREE GROWTH CURVES FOR THE SAME TEST (NATIONAL 
INTELLIGENCE TEST). A. on raw scores: B. on S.D. scores: C. on S.D. 
scores from absolute zero. Freeman (1929), Fig. 10: Odom, Fig. 6: Thur- 
stone (1928), Fig. 8. 


Kuhlmann, who accepted the curve developed by Heinis (g.v.) 
as an authentic chart of normal mental development. 

It is easy enough to run a test or a battery of tests at a number 
of different ages, and then to plot a curve through the mean scores 
obtained for those ages. But to accept such a curve as a true 
representation of mental growth is quite another matter. Three 
curves so derived are shown in Figure 28. They are all for the 
National Intelligence Test—and they are all different. The chief 
reason is that they are all three based on different statistical 
treatments of the test data. Freeman’s curve shows changes in 
average 7aw scores (v. Freeman, 1927). His procedure in this 
respect has been criticized by Odom (g.v.) and J..Peterson (1922). 
‘The objection is that equal differences in raw scores may not 
correspond to equal differences in performance. Thus it may be 
much easier to raise a score from 50 to 60 than from 120 to 130, 
and so on. 

In order to meet this difficulty, Odom took the mean score of 
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the point of origin for his Curve, and the 
EEA Hs LOUD as the unit of measurement. The 
ra scores of all other age Sroups were converted into this unit 
by the method already explained, the basic idea being that the 
increment of performance or difficulty expressed by any given 


groups were worked out in terms of these derived scores, and the 
curve plotted as shown. 


ideal conditions including perfect test reli 
Assumed that the failure of an a 


On the same test, and all 
different, authentically represents the c 
ment? This, clearly, is the question. 
make it even more cogent. The ordinar 
endorsed by Terman and his Associates, is th 
scale shows a negatively accelerated pattern 
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That is, it shows a development that is at first rapid and then 
Slows down, so the earlier mental age unit Steps are really further 
apart than the later ones. But Thurstone and Ackerson (GD; 
applying to it the method described above, derived a curve show- 
ing positive acceleration up to the age of 10, i.e., showing a de- 
velopment which begins slowly and then speeds up. So, again, 
Which picture is correct? 

. The answer is—we cannot say. Statistical ambiguities make this 
inevitable. To plot a curve on raw test scores as Freeman did is 
certainly open to objection, because equal differences between raw 
scores almost certainly do not stand for equal differences in test 


performance. Moreover, this method determines no zero point for 


an origin. But the rectifications by Odom and Thurstone, while 


they may be statistically justifiable, involve so many formal 


assumptions that the psychological meaning of the outcome is 
tion taken by 


impossible to determine. For this reason the posi 
Terman seems entirely justifiable when he says that nothing cer- 
tain is known about the curve of mental growth, and that it cannot 
be used as a psychometric tool. 

These difficulties are what might be called formal or logical. 
But there are psychological difficulties too, which are even more 
important. On all the evidence, mental growth is anything but a 
simple unitary process. On the contrary, a5 We have seen, it is com- 
Plex, diversified, composed of shifting and interlocking rhythms, 
Characterized by the sudden emergence of new functions and the 
long delay of others. A very sound argument can be made for 
denying that it is a linear process at all. If this is so, it cannot be 
represented by any single curve, NO matter what devices of statis- 
tical analysis are used. Indeed, the consequence would be that 
the better analysis became, the further it would move from the 
sort of simple representation shown above. 


4. Conclusion 

This contention is fraught with significance for psychometrics. 
It means that scores such as the L.Q., the M.A., or standard scores 
are simply convenient statistics, meaningful because they work 
and because they possess an indubitable psychological content. 
But they are not adequate indices of a unitary mental develop- 
ment, and cannot be converted into such indices. Also, this points 
the way for the future development of mental testing. The sug- 
gestion is that it should move towards greater differentiation. and 
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the construction of measuring instruments of more diversified 
kinds and for more diversified purposes, capable of dealing better 
with the immense variety of human mentality and development. 
The attempt to build tests and derive norms about a unitary 
linear sequence of mental development Seems foredoomed j because 
in all probability of the hypothesized Phenomenon does not exist. 


ical case. 
Army Alpha. From 


these three raw scores alone nothing is known except their rank 


order. Whether they are high, medium, or low in terms of test 
performance, and whether 75 is as far above 50 as 100 is above 
75 in terms of test Performance are not known. But if the mean 
and the standard deviation of the Standardization group are com- 
puted, and turn out to be 75 and 25 respectively, then 75 is average 
test performance, and the three scores become I, 0, and Lr in 


So. Here considerably 
than between 0 and —1. 


H Seis fe y 1, this inequality, al- 
though Inconvenient, is not Important. The reason is th its it 
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istributions are chosen. But this group is used as a sample 
ST larger population to which the test is intended to 
apply. The real question is whether the two raw Score differences 
indicate equal real differences in test performance or difficulty for 
the entire population and not for the sample only. But here the 
form of distribution of scores can never be directly ascertained 
unless the whole population takes the test, which is impossible. So 
it is necessary to make an assumption, this being that the distribu- 


tion is normal. Before considering how this assumption is justified 
three comments must be made. 


A. The normal curve is not any bell 
curve of the equation 


-Shaped curve. It is the 


= 


= = € 2608 
GV27T 

Where x is the abscissa, y the Ordinate, 

© the S.D. of the distribution, and N the number of cases. Glib 


talk about the normal Curve, as if it could be identified at a 
glance, is quite fallacious. 


B. There is no universal law of na 
tribute themselves or ten 


T and e are constants, 


: ture that events must dis- 
d to distribute themselves in a normal 


» P. 168) put it: “A Variable which is the 
resultant of several equally potent cau 


ences which disturb pu 
in some subject in sch. 
it is the deliberate pu 
extraneous to chance. 
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certain magnitude will occur. In both cases this is because the 
phenomena are subject to the probability distribution. So, in the 
hypothetical instance from mental testing that was discussed, we 
can say exactly what the probability is of a person making a 
score of 50 or 100 if we know that these scores are 1 S.D. below 
and 1 S.D. above the mean, and if we also know that the distribu- 
tion of the test ability in the functional population is normal. This 
is the assumption constantly made in working out interpretive 
norms for mental tests. Its basic importance for psychometrics is 
obvious. What, then, is the foundation for it? 


1. Reasons for and against the assumption of normality 


The assumption that mental traits are normally distributed 
rests upon what is essentially a circumstantial argument, for it 
cannot be directly verified. That argument has been formulated 
with great clarity and explicitness by Thorndike (v. Thorndike 
and Others). It Will first be summarized, and then objections to 
it will be considered. 

A. The argument runs as follows. ly 

(a) Scores on many tests given to large populations distribute 
themselves in bell-shaped symmetrical curves which, although not 
precisely normal, are nearly so. The approximation to normality 
is close enough so that the values computed from these distribu- 
tions will not contain serious errors. A sample of such distributions 


is shown in Figure 30. 
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(b) If the raw scores on these tests are converted into scores 
based on the standard deviations in order to equalize the units, 
the approximation to normality becomes still greater. Such distri- 
butions of converted or transformed scores are shown in Figure 31. 

(c) It might be said of some given test that it was constructed 
and its items arranged so as to produce an approximately normal 
distribution. If so, the appearance of such a distribution would 
prove nothing about the distribution of the ability it purports to 
measure. It would be artificially produced by manipulation. But 
Thorndike finds such distributions in the scores not of one but of 
many tests, and not at one but at two age levels—specifically the 
6th and gth grades. Furthermore, composite scores on all these 
tests together can be obtained after the units have been equalized. 
These composites also show an approximately normal distribu- 
tion, and they show it at three age levels, for the 6th grade, the 
9th grade, and college freshmen. 

(d) It is on these facts that Thorndike rests his case. Whatever 


€ arguments pro and con, approximately normal distributions 
of raw and derived scores do persistently appear for many dif- 
ferent tests and on many age levels. 

B. Various criticisms of the logic of this argument have been 
made by Thomas (g.v.) and by McNemar (1942). Their state- 
ments represent a considerable body of Opinion. 


(a) As to the use of raw Scores, it is pointed out that they may 
ALL 
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fact that the distribution of the original raw scores is in fact 
approximately normal. 

Thus the argument may be circumstantial, but it is not falla- 
cious. Thorndike has been charged by Thomas and McNemar 
with stacking the deck by first forcing a normal distribution on 
his data through conversion into standard deviation units, and 
then discovering a normal distribution. This is simply not true. 
And to the difficulty that the raw scores may represent unequal 
values, which is possibly so, he counters with ascertained facts 
whose impressive persistence cannot be gainsaid. 

C. Quite apart from such purely logical or systematic objec- 
tions to the assumption of normality, various difficulties have been 
found on what may be broadly considered psychological grounds. 

(a) Kuhlmann (1939), among others, has pointed out that 
human mating is by no means nonselective, i.e., a matter of pure 
chance. This, he contends, is a reason for believing that human 
mental traits, which are dependent to some extent, and perhaps 
considerably, upon heredity, may not be normally distributed. 
The fact itself, no doubt, is true. Human mating is certainly con- 
ditioned by socioeconomic status, for example. And it is known 
that the degree of resemblance between husbands and wives in 
the matter of intelligence is about the same as that found be- 
tween fraternally related persons, i.e., brothers and sisters. But 
its effect upon heredity is not at all clear. Adult standing height, 
for example, is normally distributed, although it seems to be 
largely an hereditary trait. So there seems no indubitable reason 
why selective mating should disturb the normality of the distribu- 
tion of mental traits. 

(b) Symonds (1923) undertook a revision of what he calls the 
“first approximation of the curve of distribution of intelligence.” 
This is the approximately normal curve of distribution of intelli- 
gence quotients'as reported by Terman. In making his revision, 
Symonds took the data on occupational intelligence levels as 
worked out and tabulated by Fryer (1922) from the material on 
the Army testing, to which reference has already been made. He 
then plotted the distribution of intelligence for the nine major 
occupational groups set up in the 1910 census, and combined 
them into a composite curve. This curve he showed to be heavily 
skewed towards the low end, that is, to indicate a great many 
more persons on the lower as contrasted with the higher intelli- 
gence levels. The reason is that there are far more persons in the 
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Fic. 31. THREE APPROXIMATELY NORMAL DISTRIBUTIONS OF DERIVED 
SCORES: A at sixth grade level; B at ninth grade level; C at college fresh- 
man level. Thorndike and others (1927), Figs. 87, 88, 122. 


occupations of relatively low status with regard to intelligence. 

There seems to be an unexplained contradiction here. The I.Q.’s 
of the standardization group of the first Stanford Revision dis- 
tribute normally. This group consisted of about 1,000 subjects and 
included all native white children in a community who were 
within a certain number of months of a birthday. The I.Q.’s of 
the standardization group of the second Stanford Revision also 
distribute normally. And this group was not only much larger, 
but was chosen to parallel the occupational subdivisions of the 
census. Yet on Fryer’s data this ought not to happen, because of 
the large surplus of persons on the lower occupational-intelligence 
levels. Until further analysis has been made, this would seem to 
raise a certain doubt. 

(c) Innate capacities are more likely to be normally distributed 
than those associated with environmental factors. Apart from any 
question of cause and effect it seems certain that mental traits 
are associated with amount and kind of schooling, socioeconomic 
status, home conditions, and so forth. This again raises a doubt 
as to the normal distribution of mental traits. 
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(d) We have noted a tendency to develop What have been called 
“special purpose” tests, such as the Medical Aptitude Test or the 
Towa Placement Examination. These, we have argued, are essen- 
tially mental tests slanted, standardized, and used for special func- 
tional groups. The contention has been that they reveal genuine 
mental traits in the special setting of the functional group rather 
than special aptitudes in the strict sense. But the assumption has 
always been that mental traits are normally distributed in an un- 
selected population. If such special purpose tests become more 
common, which is very likely, a problem will be created, for 
the traits which they measure may not be normally distributed in 
the functional populations for which they are designed. This can 
only be proved one way or the other by some such investigation 
as that of Thorndike. But if it should turn out to be the case, then 


all psychometric techniques based on 


the assumption of normalit 
would be disallowed with tests of this kind. a Kai 


2. Conclusion 


possibilities, and the posi 
decisively established by 


Supported though not 
that makers of many tests 


The general upshot is 
0 free and easy about 
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HEREDITY AND ENVIRONMENT 


1. The position 


There is no direct knowledge regarding the inheritance of 
human mental traits. It is a problem on which controlled direct 
investigation is virtually impossible. Such evidence as there is 
available is indirect, and a large proportion of it has already been 
discussed above in connection with other topics. 

It is established that there is a definite family resemblance in 
mental traits. This resemblance, so far as average trends are con- 
cerned, is proportional to the closeness of the blood relationship. 
Also, it tends to be transmitted from generation to generation. 
The Jonathan Edwards family, for instance, produced a long 
series of distinguished persons. Low-grade mentality and feeble- 
mindedness, too, have been found to appear generation after 
generation in the same family stock. These results are certainly 
suggestive. But they are hardly more, because the general environ- 
mental setting of a given family is probably more or less constant, 
and may well account for some part of the resemblance. It is very 
probable that innate factors are influential, but just how great 
their influence is cannot be determined. This is the more manifest 
when it is recalled that changes in domestic environment, such as 
those produced by adoption, seem to affect mentality at least to 
some extent. 

As has been shown, both the level of mentality and changes in 
that level appear to be associated with socioeconomic status, type 
of schooling, and continuation of schooling. One should be very 
hesitant about asserting the existence of a causal relationship, 
however, because the exact factors in the socioeconomic pattern 
and the educational setting that go with high mental level or the 
improvement of mentality are not clear. All that is known is that 
general favorable conditions of life, and some fairly specific favor- 
able general influences such as those of preschool attendance go 
with superior mental ability, and sometimes and for some persons, 
though not always and for all persons, are connected with an 
improvement in mental test performance. Also the converse is 
true. In fact, unfavorable general and special conditions seem 
more regularly associated with a cumulative depression of mental 
level than are favorable conditions with its cumulative improve- 
ment. 
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The fact that mental traits and abilities are within certain limits 
constant also suggests that heredity is influential. But once more 
the connection between the inheritance of mental traits and their 
constancy is not certain, because, as Kelley (1927) has pointed 
out, various traits known to be acquired may also be very per- 
sistent and hard to alter. Moreover, the constancy of mental traits 
is only relative. It is always tempting to speak, as Cobb (q.v.) 
does, about the limits set to achievement by limited intelligence, 
and no doubt they exist. The gist of Cobb’s Principal argument is 
that continuation in school and school performance are rather 
closely associated with intelligence as revealed by tests. But one 


must remember first that the school environment is biased in 
favor of those who do well on intelli 


gence tests and against those 
who do not, and second, 


that many persons could step up their 
achievement to an undetermined degree if they were taught with 


the greatest possible expertness. Limits no doubt there are, but 
the idea of a hard-and-fast ceiling is a great oversimplification. 

As to the relationship betweén race and mentality, the data are 
affected by so many doubtful factors that no conclusions can be 
safely drawn about wh 


r at it implies for the influence of heredity 
and environment. 


2. Implication for Psychometrics 

A. The general issue has 
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body an hereditarian in 


been formulate 


d by Tho .9.). He 
yChometric assu ৰ EE) 


mptions essentially em- 
mentality. (a) A test 


WIDER PSYCHOLOGICAL ISSUES 375 


Gesell and Amatruda (g.v.) find that the environmental retarda- 
tion of mentality is an undoubted fact. It is often found in young 
children brought up in a large impersonal institution, even when 
it is well run and managed. They list 12 recognizable symptoms 
of the environmental retardation of mental growth (Pp. 291). 
«(1) Diminished interest and reactivity, appearing about the Sth 
to 12th week [of life]. (2) Reduced integration of total behavior, 
about the Sth to 12th week. (3) Beginning retardation evidenced 
by disparity between exploitation in supine and sitting positions, 
from the 12th to the 16th week. (4) Excessive preoccupation with 
strange persons, from the 12th to the 16th week. (5) General over- 
all retardation of function, appearing from the 24th to the 28th 
week. (6) Blandness of facial expression, from the 24th to the 
28th week. (7) Impoverished initiative, from the 24th to the 28th 
week. (8) Channelization and stereotypes of sensori-motor be- 
havior, from the 24th to the 28th week. (9) Ineptness in new social 
situations, appearing from the 44th to the 48th week. (10) Exag- 
gerated resistance to new situations, appearing from the 48th to 
the 52nd week. (11) Relative retardation in language behavior, 
appearing from the 48th to the 52nd week. (12) Definite improve- 
ment with improved environment.” Similiar effects can undoubt. 
edly be identified in older subjects also, such as canal-boat 
children, orphanage children, and the like. A mental test which did 
not register changes of this kind, and which treated them merely as 
sources of variable error, would not be penetrating down to some 
unchanging mental substratum. It would simply be a bad test. 
And of course a good test should reflect favorable effects as well. 

Let us consider how this bears upon the three assumptions 
whose hereditarian basis is alleged by Thomas. 

(a) A test must always be built about a certain concept. This 
concept is translated into test items, and item performance is 
evaluated in terms of the performance of a standardization group. 
But it need not be understood as corresponding to a unitary 
mental factor or entity. It is simply a category for the 
classification and better understanding of behavior, valid in so 
far as it works, like the psychiatric categories, for instance. Gen- 
eral intelligence as a concept has on the whole worked well, and 
many of the endeavors to improve test construction turn on the 
search for better pragmatic categories. Psychiatrists presumably 
do not think of schizophrenia as a single monolithic unalterable 
mental entity, but only as a syndrome subject to all kinds of influ- 
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ences. The same attitude should be taken regarding Psychometric 
concepts. Without such concepts the exploration of mental life 
would be impossible, and if they are crude or erroneous, the 
purpose should be to improve them. Ee 

(b) As to the environmental content of tests, the point is not 
to try to eliminate it, which would be impossible. Nor is it neces- 
sary to assume that the incidentally acquired mastery of such 
material is directly proportional to mental level. Mentality is not 


al material. It expresses 
rd (1943) puts the matter, 
rials so that differences in 


ate ca es just like any other 
Changes, which is just as it should be. 

Thus psychometric Practice does not depend upon or imply an 
hereditarian position. 
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fairly wide limits is probably inert in its relationship to mentality. 
Or again, there is reason, as we have seen, to believe that certain 
kinds of school experience are associated with improved mental 
status. But we only know in a general way what kind, and the 
identification of the active and prepotent factors is only a matter 
of reasonable guess. So once more, the concept of chronological 
age covers a multitude of differences which probably affect men- 
tality on the principle of fifty years of Europe being worth a cycle 
of Cathay. 

Furthermore, environment cannot be evaluated in isolation, 
but only with reference to the responding individual. The same 
environment almost certainly does not have the same effect on 
persons of different mental levels and dispositions, and of dif- 
ferent ages. A bright and sensitive child will quite possibly be 
gravely depressed by an environment which an average or below- 
average child tolerates quite well. 

These are some of the questions that press upon us in view of 
what seems to be a reasonable position regarding the influence 
of heredity and environment. An outright hereditarian assumption 
is quite impossible. Neither is it possible to determine the pro- 
portional influence of innate and environmental factors, although 
this has from time to time been attempted. Thus Leahy (19035) 
in her study of the psychological effect of placement in foster 
homes finds by means of a statistical analysis of her data that 
specific home environment accounts for 4% of changes in the 
mentality of the adopted children who were her subjects. Newman, 
Freeman, and Holzinger (q.9.), again, in reporting their work 
on 19 pairs of identical twins raised apart, find that social and 
economic influences account for 72% of the divergences which 
appeared when all 19 pairs were considered, but that when the 
4 most extreme cases were eliminate, the figure falls to 20%. 
Numerous other investigators, including Hirsch (1930) and Burks 
(1928), have taken the responsibility for similar statements. As 
demonstrations of statistical analysis they are interesting, but it 
is doubtful whether they have much assignable general meaning. 
In any case the point is not to adjudicate a partisan competition 
between heredity and environment, but to discover what influ- 
ences are prepotent, how much change they can produce, and why. 
And this does not imply the least disparagement of psychometrics, 
but rather the contrary, for without its techniques and instru- 
mentalities such investigations would be impossible. 
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THE PsYCHOLOGICAL SIGNIFICANCE OF TEST ScoRES 


The best-known interpretive classification of test Scores is that 
of Terman, which is shown in Table 55. As will be seen, it attaches 
more or less definitely meaningful Ccharact 
I.Q. levels. Many objections to it have 
guidance officers, workers in applied Ps 


erizations to various 
been made by clinicians, 
ychology, and others who 


TABLE 55 
INTELLIGENCE CLASSIFIED BY IL.Q. LEVELS 


(Terman, 1916, p. 79) 


1.0. Classification 


Above 140 ..... 
EISO—~IAS 5 iis ans 
IIO-I20 ... 


“Near” genius or genius 

Very superior intelligence 
+**.| Superior intelligence 

***.*.| Normal or average intelligence 


EO=DG nian *-.| Dullness, rarely classifiable as feeble-mindedness 

T0806 0000 *.*| Borderline deficiency, Sometimes classifiable as dull- 
ness, often as feeble-mindedness 

BElOW YO ata a Definite feeble-mindedness 
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true defectiveness. Thus 
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classification, with less positive designations. This is shown in 
Table 56. It will be noted that Wechsler drops both the term 
genius and the term feeble-minded, with their far-reaching and 
uncertain connotations. Also, he has proposed still another scheme, 
which is purely statistical, based on nothing but the portions of 


TABLE 56 
INTELLIGENCE CLASSIFIED ACCORDING TO I.Q. 


(Wechsler, 1044, Table 4, P. 40) 


Classification 1.0. Limits | Percent Included 
THEAEONVE  cegtanast-as son HH +.*...| 65 and below 2.2 
Border: Lig wus easaan «aissue 0.5% 5s 66to 179 6.7 
Dull Nora] Gs awns ed st oawdeso sas.nm Soto 90 16.1 
Normal sj ees gr to IIo 50.0 
Bright Normal .. III to II9 16.1 
DSUDeNOL. as a saw ttR AE TTA FEES SE 120 to 127 6.7 
Very ASUpeRiOF + cs Aa Laas ae 128 and over 2.2 


the total distribution of I.Q.’s and the percentage of the total 
number of cases included in each category. He still attaches de- 
scriptive terms to these classes, but the essence of the scheme is 
categorization in statistical terms. This arrangement is shown 


in Table 57. ETB 


STATISTICAL BASIS OF I.Q. CLASSIFICATION 


(Wechsler, 1944, Table 3, P. 40) 


Classification Limits in Terms of P.E. | Percent Included 
DefECtNe aaiciefete.siai se4,00rs —3 P.E. or less 2.15 
Boxdex Ein ea toivs avis etn uso —2 to —3 P.E. 6.72 
Dull Normal ... | — 1 to—2PE. 16.13 
NOES eo kiccmyte MELEE wets mR —I to +1 P.E. 5০.০০ 
Bright Normal astute bat “72:0 2B. 16.13 
Superior ...... +2 to +3 P.E. 6.72 
Very Superior .. +3 P.E. and above 2.15 
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1. Test scores and social meanings 


All this well defines the problem of attaching psychological 
significance to test scores. It is, from one point of view at least, 
to determine and define the actual capacity for effective social 
behavior indicated by a given score, or a score within a given 
range. 

A. Feeble-mindedness is a psycho-social rather than a PSy- 
chometric category. The British Mental Deficiency Act defines 
an idiot as one who is unable to protect himself against common 
physical dangers, an imbecile as one who is unable to communi- 
cate by written language either through reading or writing it, and 
a moron somewhat less clearly as one who needs care and super- 
vision for the protection of himself and others. These are degrees 
of feeble-mindedness, roughly described in terms of social be- 
havior. A given mental test Score may indicate one of these classi- 
fications, but it is only one criterion and needs to be supplemented. 


Some legal definitions of feeble-mindedness, framed with a view 
to institutional commitment, s 


pecify an I.Q. of less than 75, others 
an LQ. of less than 70. Wechsler himself has pointed out that two 
children may have the same LO. and yet require very different 
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It was found that among the gifted the ratio of boys to girls was 
higher than that in the general population. In racial origin these 
California children were found to be mainly of Western European 
and Jewish stock. The Jewish stock contributed about twice that 
expected from the total Jewish population of the areas investi- 
gated. The average social status of the families was much higher 
than that of the average family. In general the family incomes are 
fair, and they live in superior neighbourhoods, Dut there are iso- 
lated cases from very poor families living in inferior neighbour- 
hoods. These children come from families where there are dis- 
tinguished relatives in much greater proportion than would be 
found in the average family. The vital statistics of the families 
show a healthier than average stock, with few cases of insanity or 
feeblemindedness. The anthropometric measurements show the 
gifted group physically superior. The medical examinations show 
them also superior to average children. In school progress they 
are 14 per cent of their age above the norm in grade location, 
and 48 per cent of their age above the norm in intelligence, so 
that they are under-promoted to the extent of 34 per cent. Their 
school marks are better than those of ordinary children. On 
standard educational tests the E.Q.’s of the gifted are high, but 
not as high as their I1.Q.’s. The gifted are no more uneven in their 
school abilities than ordinary children. Their occupational ambi- 
tions are higher than those of the control group. In general they 
have the same type of interests as ordinary children. They make 
more collections, particularly of a scientific nature. Their play 
interests are in general like those of the control group, with a 
somewhat greater interest in plays that require thinking. They are 
more mature in their play interests, showing a greater liking for 
quieter and less sociable games. These gifted children read a great 
deal more than does the average child. The average gifted child 
of 7 reads more books in two months than the average control 
child up to age 15, and the range of reading is much wider. In 
character and personality they are very superior, about 85 per 
cent of the gifted being above the median of the control group” 
(pp. 361-62). 

In the same way, but in an investigation of less scope, Lorge 
and Hollingworth (g.v.) followed up a group of very superior chil- 
dren who had reached the ceiling of the L.E.R. Intelligence Scale 
CAVD during secondary school, and of whom some had I.Q.’'s in 
excess of 170. At college age they were found to have carried on 


382 PSYCHOLOGICAL TESTING 


research in history, mathematics, and chess, and two of them were 
established in the learned professions. 

Thus there can be no doubt that a high intelligence score is an 
indicator of general high level social behavior. But as Stoddard 
(1943) as well as many others have pointed out, the statement 
that it indicates genius is most misleading. The performance of 
the superior group studied by Lorge and Hollingworth is no doubt 
highly creditable and unusual, but it has none of the distinctive 
marks of supreme creative achievement. And an I.O. of 140, far 


from indicating genius, is actually reached by a not inconsiderable 
Proportion of American college students. 
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tion tests measured in seconds for 766 boys 14 years old had a 
mean range of 3.87:1. This would seem to dispose of the claims 
sometimes found that human beings are likely to show such ex- 
treme differences in respect of this or that trait that some will be 
ten, twenty, or more times as able as others. As Wechsler points 
out, the human race is much more homogeneous than some other 
organic types, such as dogs or trees. And one consequence is that 
comparatively small differences in test performance may point to 
enormous differences in prestige and social and general success. 


2. Quantity and quality 

Another way of looking at the problem of the psychological 
significance of test scores is to think of it as the assignment of 
qualitative meanings to quantitative indices. 

Thus a mental age score of 10 attained by a person whose 
chronological age is fifteen indicates a different kind of mind from 
that of a person whose M.A. and C.A. are both 10. The former 
will probably excel the latter at muscular, motor, and routine 
performances including memory, whereas the latter will excel the 
former in verbal discrimination and linguistic and numerical per- 
formances involving high-level organization (Merrill, 1024; E. B. 
Greene). Terman (1906) found just such differences between 
seven “bright” and seven “stupid” boys. 

Also, mental test performance reflects personal and tempera- 
mental type. We have seen that there are characteristic differences 
in Stanford-Binet and Wechsler-Bellevue profiles for different 
psychotic categories. Wells and Kelley (g.v.) found the perform- 
ance of psychotics on vocabulary and digit memory stood up well, 
but that there was marked deterioration in drawing designs from 
memory, paragraph interpretation, and the Ball and Field test 
(Stanford-Binet year 12). In all probability less marked tempera- 
mental deviations also affect test performance qualitatively. 

Top level differences are also significant. Hollingworth and 
Cobb (g.v.) studied 20 pairs of children matched on home con- 
ditions, whose mean I.Q.’s were 165 and 146. The brighter were 
markedly better in more complex tasks, such as word interpreta- 
tion, language use, and mathematical thinking. In routine tasks 
there was little difference. The two groups entered school at mean 
M.A.'s of 11-11 and 13-4. By the time the duller group had 
reached a mean M.A. of 13-4, their school performance was not 
equal to that of the brighter. . 
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All aspects of personality are tied in with a mental test score 
and what it measures or indicates. Thus it has long been known 
that there is an association between low mentality and delin- 
quency (v. Pintner, 1937, for evidence up to that time, and Lane 
and Witty for a summary of later findings). So Ackerson (g.v.) 
finds a much higher proportion of behavior problems among chil- 
dren of low I.Q. among a population of 5,000 of which he made 
a study. Yet we certainly cannot say that low intelligence is a 
direct cause of delinquency. The truth is that less intelligent 
individuals tend to be easily led, Passive, timid, easy victims for 
vicious leadership and suggestion, often fixed in an environment 
where opportunity and stimulation are lacking, and so unable to 
use what capacities they have (Doll, 1934, 1940). How much 
social setting and tradition have to do with delinquency is shown 
by the fact that it is found much more frequently in boys than 
in girls, the Proportion running from 5 to 3 to 7 to 3. Nevertheless 
the typical delinquent 1s at most of dull normal mentality, and 
an I1.Q. of $0 is an important diagnostic sign, indicating the sort 
of problems and difficulties the person is likely to encounter in 
his living. 


S0; conversely, is a very high intelligence score. Reference has 
already been made to the study by Lamson (1930) in which she 
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quotients were: Byron, 150; Scott, 150; Darwin, 135; Goethe, 
Mill, Macaulay, Pascal, and Leibnitz, all above 180. But Cox also 
points out that for the greatest achievement favorable character 
traits are likewise of the highest importance. Galton's famous 
threefold formula for achievement still holds. Tt consists of great 
ability, power to work, and willingness to work. Hollingworth and 
Terman (g.v.) have remarked that not one person in four or five 
of those of I.Q. 180 and over have all three characteristics. 


3. Conclusion 


The clear general conclusion is that psychometric scores, be- 
cause they are quantitative and obtained under uniform condi- 
tions, are extremely valuable indices. But they always need to be 
interpreted in the light of all the facts, social and personal. They 
can greatly aid us to understand and guide human beings as total 
personalities in total life settings. This in itself is evidence that 
the working concepts around which tests are built are reasonably 
well founded and provide authentic guide lines for the under- 
standing of human mentality, and that the tests themselves are 
reasonably valid. The contention of Thomas and others that psy- 
chometrics is essentially atomistic or mechanistic, that of neces- 
sity it deals only with parts and never with the organic whole, 
is nonsense. Of course, psychometrics is analytic, but no phe- 
nomenon can be explained, or understood, or managed, without 
analysis. The only questions are whether the analysis is tolerably 
sound, and whether its results are put to wise and constructive 
uses. 
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point of their use with adults apply also to other types of tests, e.g., 
Some of the personality tests previously discussed? 

9. Check over in detail and with care the statistical procedures 
used by Odom and by Thurstone in trying to establish equal units 
and a true zero point. What logical or statistical assumptions do they 
seem to make? How does this affect the interpretation of their results? 

10. If broad statements as to the proportional effect of heredity 
and environment cannot be validly made, is the whole issue without 
practical importance? If this is not true, what importance has it? 

II. Suggest some of the matters concerning vocational, educational, 
social, and business relationships and activities, and in fact concern- 
ing the whole range of a person’s living that might be indicated by 
either a high or a low intelligence test score. How certainly would 
they be indicated by such a score? If not with complete certainty, 
would the score still suggest possible interpretations, cautions, etc.? 
What other factors should be considered also? 


CHAPTER XI 


THE EVOLUTION AND IMPROVEMENT OF 
MENTAL TESTING 


PLAN OF THE CHAPTER 


The time has now come to weave together the various lines of 
discussion presented in this book into 
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diced can deny. But the main trend of evolution is towards the 
making of tests that will yield surer and more significant indices 
of the quality of the individual's mental life, and the probable 
effectiveness of his social behavior. That this is a subject of the 
highest theoretical interest to the specialist in mental testing 
clearly needs no demonstration. But it has a practical importance 
also, for only by an intelligent appreciation of what has been 
accomplished and what is being attempted can any person use and 
interpret existing tests judiciously, or avail himself of those that 
are emerging. 

This quest for greater psychological significance or more authen- 
tic validity will be considered under four aspects—the search for 
the most stable and meaningful unit of measurement, the search 
for the most significant interpretive standardization, the search 
for new and better operating concepts, and the tapping of new 
Psychological resources, this last having to do with the emergence 
of projective rather than psychometric tests. All these four aspects 
are interrelated, and their separate treatment, like all abstraction 
and analysis, is only for the necessary purpose of clarification. 


THE STABILITY AND MEANINGFULNESS OF UNITS OF 
MEASUREMENT 


For any unit to be stable, it must fulfill two conditions. First, 
there must be a fixed point of reference, an origin, whose meaning 
is unambiguously defined. When temperature is being measured, 
the point of origin is established by the freezing point of water 
under stated atmospheric conditions. On the centigrade scale this 
is called zero, and on the Fahrenheit scale thirty-two degrees. Here 
as always, the name or symbol used does not matter so long as 
its reference is unmistakable. When weight is being measured, the 
point of origin is established when an equalized balance scale is 
horizontal. When distance is being measured, the feet, or centi- 
meters, or other units are laid off from a zero point from which 
the measurement starts. No scale of units which does not yield 
Such a fixed origin can be used to obtain stable results. 

The second condition is that each unit must be equal to every 
other unit, or vary from it according to a known law. The lati- 
tude units on a globe illustrate the former alternative, and those 
on a Mercator projection map the second. Inequality of units does 
not matter so long as the law of their variation is known. One 
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might roughly measure the speed of a car by recording the amount 
of gasoline consumed per equal unit of time. The relationship 
would be a changing one because it takes much more power to 
accelerate from sixty to seventy miles an hour than from ten to 
twenty. But although it changes, the law of its change is known 
and can be allowed for in our computations. 

These same two conditions must be fulfilled by units and scales 
of mental measurement if they are to possess the characteristic of 
stability. 

As to meaningfulness, a unit of measurement is meaningful in 
so far as it enables to predict, or understand, or deal with a range 
of actual phenomena. Degrees of temperature, inches, or pounds 
are meaningful in this sense. They fulfill the logical and statistical 
requirements for stability, or they would be useless. But they are 
much more than abstract statistical entities, because they enable 
us to deal with phenomena and experience by giving answers to 
the question: How much? Ea 
f Units of psychological measurement, too, must be meaningful 
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of an age group, although logically it could also be defined as the 
mean age of a group achieving a given test performance or raw 
score. Thus the M.A. fulfills the first condition for stability, that 
of having an unambiguous origin of reference. The second con- 
dition, however, that of equality, it does not fulfill very satisfac- 
torily, and this is one of its chief limitations. The real differences 
in mental ability indicated by equal M.A. differences almost cer- 
tainly vary and become less as age advances. Thus there is more 
change in mentality between M.A.'s 4 and 5 than between 11 and 
12. But since the form of the mental growth curve is not known 
and perhaps never can be known, the law of these changes cannot 
be determined, as it can very easily on a Mercator map of the 
world, and only very rough allowance can be made for them. 
Also, the M.A. is not applicable above the “age of arrest,” which 
differs for different tests, and means the chronological age at which 
test scores do not increase regularly as C.A. increases. Thus the 
M.A. is by no means a perfectly stable unit of measurement, but 
it has enough stability to be used to good purpose with proper 
precautions. 

As to meaningfulness, the following points are important. (a) 
The M.A. is a measure of the level of mental development or 
maturity, and not of absolute brightness. Thus the proper time 
for first grade entry is often set at M.A. 6, without direct reference 
to the I.Q., which has a different significance. (b) It depends for 
its significance on the type of test used. M.A.'s on verbal tests 
and performance tests correlate positively but only moderately. 
Thus, in using the M.A. for the evaluation or guidance of any 
person, the test on which it is computed should always be con- 
sidered. (c) A given M.A. has different qualitative and behavior- 
istic meanings at different chronological age levels. This we have 
already seen. If two children C.A. 6 and 12 both have an M.A. 6, 
it indicates quite different patterns of social behavior and types 
of mind. To put it more obviously still, no one would think a 
person C.A. go and M.A. 6 was fit for first grade entry. These, of 
course, are limitations, but again they are not fatal. In using and 
interpreting the M.A. the following cautions should always be 
observed. (a) It does not apply at all to ages above about 16, and 
not very well to very young children. (b) The real value of M.A. 
differences diminishes as age advances, just how much we do not 
know. (c) The test from which the M.A. is derived should always 
be considered. (d) The C.A. of the person concerned should 
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always be considered, and indeed his whole personal and social 
setting. 

B. The Intelligence Quotient. The intelligence quotient, in the 
only legitimate sense of the term, is the ratio of mental age to 
chronological age. The problem of its Stability has given rise to 
numerous misunderstandings and should be stated and understood 
as clearly as possible. 

With the intelligence quotient, as with any other unit of meas- 
urement, a fixed and Unambiguous reference point or Origin is 
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that the I.Q. or other such measure loses all significant stability. 
This is so with the Merrill-Palmer scale, as we have seen. It is 
strikingly true of the Pintner Non-Language Test, as is shown by 
the data reported by Rand shown in Table 58. Test performance 
one standard deviation above the mean (100) would give a C.I. of 
167 at C.A. 7, and 115 at C.A. 14. Thus there are certain tests 
where the use of the intelligence quotient is inadmissible. (b) If all 
I.Q.’s other than 100 are to have a stable meaning, it is neces- 
sary not only that the range or amount of the age level distributions 
be approximately equal, but also that their form be the same. If 
at one age level there were a distribution shaped like that shown 
in Figure 294 (i.e., normal), and at another age level, like Figure 
29 (skewed), then even if the ranges and the S.D.'s were the 
same, they would not be equal units. This requirement of normal 
distribution at all ages seems to be approximately fulfilled by 
Stanford-Binet I.Q.’s up to C.A. 16, beyond which it becomes a 
pure assumption. As to other tests, they vary considerably in this 
respect. So the regular appearance of normal distributions should 
never be taken for granted, and the direct evidence presented by 
the authors should always be noted and checked. (For a fuller 
discussion of this subject, though in terms of somewhat outmoded 


data, see Rand). 
TABLE 58 


STANDARD DEVIATIONS OF COEFFICIENTS OF INTELLIGENCE ON 
PINTNER NON-LANGUAGE TEST FOR VARIOUS AGES 


(From Rand, Table 1, p. 605) 


হ$ 14 


Age Levels: , sees sisi ¥ 8 9 12 
3 59 37 32 32 33 IS 


Standard Deviations ... 67 7. 


As to the psychological significance of the intelligent quotient, 
it turns upon a very simple, straightforward, common-sense idea, 
almost too familiar to most parents. This is that the age when a 
mental function or performance appears is highly indicative. To 
learn algebra at 15 is not remarkable, but to learn it at 6 indicates 
endowment of the highest and rarest order. This piece of common 
sense is embodied in statistical form in the I.Q. Whereas the M.A. 
measures mentality in terms of its developmental levej the I.O. 
measures it in terms of brightness. 
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It is sometimes said that the I.Q. indicates the speed of mental 
growth. But this is much more doubtful, since the very idea of 
a growth curve and so of an over-all rate of growth is open to 
question. There has been considerable discussion as to the growth 
patterns implied in the use of the LQ. Freeman (1939, 1921) has 
pointed out two alternatives. Suppose that three children, A, B, 
and C, respectively, are bright, average, and dull. Their mental 
growth may start at the same point of origin and proceed at dif- 
ferent speeds, so that the spread between them increases as they 
grow older. Or it may start at different origins and run parallel, 
So that the spread between them remains constant. The conditions 
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score to the mean point score for his age. Thus, if the mean point 
score for C.A. § is 100, and the individual's point score is 120, 
his C.I. is 120. This measure has virtually the same logical charac- 
teristics and general meaning as the I.Q. The point of reference is 
the mean point score for the age group, which is always 100. To 
be a stable unit, it must show even distributions for different ages, 
and these must be normal. Much less study has been made of the 
CL. than of the I.Q., and it cannot be compared certainly with 
the I.Q. on the basis of demonstrated constancy because of lack of 
data (v. Rand). Here again the essential idea is to have a measure 
that will give the same designation to the same degree of bright- 
ness at all ages. 

D. The Index of Brightness. Unlike the two previous measures, 
the index of brightness, or I.B., turns upon the use of a difference 
rather than a quotient. It is the difference of the individual point 
Score and the mean point score for age added to or subtracted 
from 100. If the mean is 110, and two individuals score 90 and 130, 
the I.B.’s are S80 and 120. The measure has not been much used. 

E. The Percent of Average. This is Kuhlmann's variant on the 
Heinis (g.v.) personal constant. It means the percent of averzge 
development as Shown on the alleged normal growth curve attained 
by any individual at a given age. Three advantages are claimed 
for it. (a) It is based on an actual curve of normal growth, unlike 
the M.A. and the I.Q. (b) It has a greater statistical stability 
than the I.Q. That is, its range of distribution at different ages is 
more uniform. A comparison of Tables 10 and 11 will show the 
prima facie evidence for this claim. (c) There is a tendency, noted 
by P. Cattell (1937) and others, for high I.Q.’s to rise and low 
ones to fall. This is said not to be true of the P.A. (v. Hilden). 
To this the following replies have been made. (a) The alleged 
growth curve is mythological. Its derivation by Heinis (g.v.) is 
unsound, and there is doubt whether any general growth curve is 
possible. (b) The appearance of greater statistical stability is 
illusory. When the fluctuations are considered, not absolutely but 
in relation to the magnitude of the scores themselves, the advan- 
tage of the P.A. over the I.Q. vanishes (McNemar, 1942). (Cc) 
There is evidence that low P.A.’s tend to drop and it is probable 
that high ones tend to rise, similarly to the I.Q. (P. Cattell, 1933). 

Kuhlmann (1939) has offered a rebuttal to the criticism that 
since the form of the growth curve is unknown, the P.A. is illusory. 
Its essential purpose, he says, is to give the same designation to 
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the same degree of brightness at all ages. Thus it would still be 
valuable even if the very notion of a general growth curve were 
admitted to be untenable. 

E. Purely Statistical Measures. There is a tendency to prefer 
to all the above, purely statistical measures without psychologi- 
cally suggestive names. These are based on some unit of dispersion. 

(a) The raw scores may be arranged in percentile ranks, as with 
the Otis scores shown in Table 15. This tells us what percentage 
of all scores is exceeded by any given score. Thus in Table 15 
the raw score of 70 is higher than 84% of all scores attained. The 
limitation of this method is its distortion of the data. In the table 
a raw score difference of 1 between 62 and 61 is a percentile 
difference of 3 points. But the raw score difference of 8 points 
between 33 and 25 is a percentile difference of less than 1 point. 
An illustration will help to show why this happens. Let us sup- 
pose that 20 players are entered in a tennis match, that the middle 
10 are very close to each other in ability, and the top 5 and the 
bottom 5 quite widely spaced. If the players are ranked in centiles, 
a single percentile difference in the middle of the distribution will 
stand for a much smaller difference in real ability than at the top 
or bottom ends. Also a very small shift in performance among 
the middle ro will yield a percentile shift, while it would take a 
considerable shift in performance at either end to change the 
percentile ranking. Thus, if we want a measure that will give the 
same designation to the same amount of brightness at all levels 
and all ages, the percentile is not likely to do so, because it 
is affected differently at different levels by changes of bright- 
ness. 

(b) To avoid this, measures based on the standard deviation 
are more commonly employed and generally recommended. An 
instance may be seen in Table 25. Here the deviations of the raw 
scores from the mean expressed in S.D. units are converted into 
standard scores by multiplying them by 10 and adding 50 to avoid 
plus and minus signs. Wechsler’s method of deriving his misnamed 
I.Q. is a variant of standard deviation scoring. He uses the prob- 
able error, which is .6745 times the standard deviation, and de- 
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such and such a test performance is 1 S.D. above the mean at 
age 6, and another test performance is 1 S.D. above the mean at 
age 12, then both are of equal difficulty for their age groups. The 
great stability so obtained may be seen from Table 12, where the 
S.D.’s of the Wechsler so-called I.Q.'s are given. In point of sta- 
bility, i.e., uniformity of distribution, they are clearly superior to 
both Stanford-Binet I[.Q.’s and P.A.’s. 

McNemar (1942) and Terman and Merrill (1037 b) have never- 
theless defended the I.Q. as preferable to the standard score. 
For this they give two reasons. First, teachers and parents and 
others understand the I.Q. and the M.A., but do not well under- 
stand the standard score and cannot interpret it. This may be 
true. But it has been urged in reply that persons who are con- 
cerned with tests and test results should be educated to under- 
stand the best available measures, and that so-called lay under- 
standing of the I.Q. consists of a good deal of misunderstanding. 
The second point made by McNemar and Terman and Merrill is 
that the I.Q. is itself in effect a standard deviation score, since 
it, too, indicates the divergence of an individual test performance 
from the age norm in terms of the age distribution. However, this 
is only approximately true, as we have seen; and there seems no 
good reason for preferring a measure somewhat inaccurately ex- 
pressing deviation from the mean and weighted with a name full 
of implications to a measure which is entirely accurate in this 
respect and without any psychological suggestions or overtones 
whatsoever. 

For the practical effect of all these purely statistical measures 
is to throw more of the burden of interpretation upon the person 
Who uses the test and its results. This is just as it should be. A 
good score should be a stable measure with clearly defined charac- 
teristics, i.e., as indicating brightness or maturity. It should show 
as unambiguously as possible the standing of any individual rela- 
tive to the ability defined and embodied in the test. It should 
avoid forcing interpretations which might be valid for the stand- 
ardization group but not for other groups or individuals. It should, 
in other words, tell as directly as may be a factual story. It need 
not, for this reason, lack psychological content and significance. 
If we know that a person is 2.5 S.D. above the mean on a good 
intelligence test, this is a more accurate piece of information than 
knowing that his I.Q. is 135, because this S.D. deviation is directly 
and without question comparable to all others of the same amount 
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at other age levels, whereas an I.Q. of 135 at 8 is only approxi- 
mately comparable to an I.Q. of 135 at 12. Everything contained 
or implied in the I.Q score is contained and implied in the S.D. 
score, and it is more precisely expressed. 

The only reservation to make in connection with standard 
scoring is that unless the distributions are approximately normal 
it becomes seriously misleading. The magnitude or range of the 
distribution, so long as it remains the same, does not affect the 
situation, but its form does. In the case of the best-constructed 
tests there is fairly good reason to believe that the distributions 
are approximately normal, although conclusive proof can never 
be had. The caution, however, is well worth observing, since there 
are plenty of tests on the market which are neither well con- 
structed nor competently analyzed, and whose authors seem to 
think that a normal distribution is to be assumed as a consequence 
of natural law instead of something for which evidence should be 
forthcoming. 

F. The Accomplishment Quotient. The accomplishment quo- 
tient is the ratio of educational age to mental age, or of edu- 
cational quotient to intelligence quotient, the idea being first 
formulated by I'ranzen (g.v.). It is intended to measure effort or 
motivation. If educational age is less than mental age, the AQ. 
is less than roo. If it is equal to mental age, the A.Q. is 100. If it 
is greater the A.Q. is above 100 (wv. Stebbins and Pechstein ; 
Cureton). It must not, however, be thought of as a ratio between 
pure intelligence on the one hand, and pure achievement on the 
other, for intelligence tests and achievement tests have much in 
common. Rather, it is a ratio between a less and a more general 
measure. It has the following disadvantages. (a) It tends to penal- 
ize the bright pupil, who cannot work “up to capacity” as readily 
as the dull pupil. (b) Since it is derived from two tests, one men- 
tal and the other educational, it has a lower reliability than either 
(v. Chapman). (c) It will not be a valid measure unless the 
distributions of the two tests are approximately equal, for only 
then will a given E.Q. (say, 120) be comparable with the nomi- 
nally corresponding I.Q. of 120. Unless the two distributions are 
the same, the two seemingly identical figures will have different 
real values (v. Freeman, 1939). As a formal instrumentality it 
cannot be recommended. But the same idea can be applied in the 
ability standard technique (v. Symonds, 1927). Mean scores on 
educational tests can be computed for various intelligence level 
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groups, and these can be used as standards for pupils whose intelli- 
gence ratings are known. 


2. Profile Scoring 


As we have seen, there is today an increasing criticism of global 
scores on the ground that they conceal relevant differences under 
average ratings, and thus obscure diagnosis and analysis. The 
alternative is the profile rating, which presents a pattern of scores 
on separate traits or functions, with or without an average or 
over-all global rating as well. Seashore, for example, definitely 
Objects to the averaging of scores on the six subtests of his bat- 
tery, and claims that results should always be reported as a pro- 
file of pitch discrimination, time discrimination, loudness dis- 
crimination, timbre discrimination, rhythm recognition, and tonal 
memory. The separate scores on which the profile is constructed 
are usually coniputed on the basis of percentile ranks or standard 
deviations. In the above case the percentile method is used. The 
comments already made on these methods apply here and will not 
be repeated. 

In all profile rating the key question is this: How are the sepa- 
rate traits or abilities which the profile represents conceived and 


+ defined? If they are ill-conceived, ill-defined, and so not authentic, 


the profile means nothing at all. It is indeed worse than useless, 
for it becomes highly misleading. A few illustrative references 
will make the issue clear. 

The six traits embodied in the six subtests of the Seashore bat- 
tery are without doubt defined both in words and in terms of test 
items with great precision. One may fairly ask whether they really 
are components of musica] talent and whether the subtests are 
reliable enough to yield a stable profile when the differences are 
small. But with these reservations, the obtained profile has an 
exact and indubitable meaning. In the same way the Minnesota 
Multiphasic Personality Inventory is capable of establishing sig- 
nificant and important diagnostic differentiations for most, though 
perhaps not for all, of its different scales, because of their well- 
defined meaning, based on psychiatric theory and practice, and 
their careful embodiment in test items. The same cannot be said 
of the Bernreuter Personality Inventory, because its six classifica- 
tions—introversion-extroversion, neurotic tendency, dominance- 
submission, self-sufficiency, self-confidence, and sociability—are 
unstable, indefinite, and dubious. A three-class profile based on the 
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Detroit General Aptitudes Examination, to show differentials in 
general intelligence, clerical aptitude, and mechanical aptitude, 
would almost certainly have little meaning, because the functions 
are not well clarified and defined either verbally or in the test 
items. 

Among intelligence tests the Wechsler-Bellevue scale estab- 
lishes an effective and meaningful two-class profile between per- 
formance-test intelligence and verbal-test intelligence, and profiles 
based on its 11 separate subtests are found by Rapaport to pos- 
sess considerable diagnostic efficiency. The latest editions of the 
American Courcil on Education Psychological Examination pur- 
port to establish two-class profiles on quantitative and verbal 
ability, though their validity for guidance has not been estab- 
lished so far as the present writer knows. The Chicago Test of 
Primary Mental Abilities represents the most ambitious effort in 
this direction in the field of intelligence testing, establishing pro- 
files to differentiate perceptual ability, numerical ability, verbal 
ability, space-visualizing ability, memory, induction, and deduc- 
tion. Considering the enormous and intricate research devoted to 
separating out and defining these alleged separate abilities, the 
result must be treated with great respect. But once again the 
psychological significance and authenticity of the profiles has not 
yet been sufficiently established for any final judgment. With the 
California Tests of Mental Maturity the case is more doubtful, 
for its categories of memory, spatial relationships, reasoning, and 
memory are open to very grave question. As Kuhlmann says, the 
proposition of showing that a child has a “good memory” is one to 
cause very considerable hesitation. As to the Detroit Tests of 
Learning Aptitude there can, unfortunately, be very little doubt 
at all, for any profile ratings yielded will be upon vaguely defined 
and nebulous faculties, to the existence and significance of which 
there is simply no direct witness whatsoever. 

So, to sum it up, it is easy enough to produce some kind of profile 
based on differential test performance, but to produce a good and 
meaningful one is quite another matter. The danger obviously 
is that we shall have a proliferation of alleged profiles based 
on sheer conjecture without any underlying framework of well- 
analyzed, well-defined, significant psychological concepts. Even 
global scores will be better than that! 
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STANDARDIZATION 


As shown repeatedly in these Pages, the very essence of psy- 
chometric procedure turns upon the interpretation of a given 
individual's test performance in terms of the performance of a 
Population or universe of discourse. But since the entire popula- 
tion can very rarely be tested, recourse must be had to a sample, 
known as a standardization group. Mental tests never measure 
any psychological function or ability directly, but always by set- 
ting up a behavior pattern believed to embody it, and instituting 
a comparison between that behavior as manifested by the indi- 
vidual subject and by the population, through the mediation of a 
standardization group. This fact, that mental measurement is 
always relative or comparative, is nothing against it. A great many 
Such comparative judgments are constantly made in everyday life, 
and indeed in science. If the average length of the noses of two 
racial groups differed by one centimeter, it would be a very strik- 
ing phenomenon. But if there were one centimeter of mean dif- 
ference in their standing height, it would be far less noticeable 
and presumably less important. One centimeter, or one milligram, 
Or one degree of temperature centigrade have very different mean- 
ings in different situations; and judgments, both common-sense 
and scientific, always depend upon the interpretation of obtained 
differences comparatively or relatively to their background. There 
is, therefore, no Objection on a logical basis to this aspect of psy- 
chometric procedure. 

But experience and research have brought to the foreground 
many problems in the actual application of the logic of compara- 
tive judgment in the field of mental measurement. That the stand- 
ardization group must be a representative and adequate sample 
goes without saying. For this the primary consideration is not its 
size but its selection. And the selection of a true and unbiased 
sample of human beings with respect to some trait is exceedingly 
difficult. Indeed, in virtually all test construction, even the very 
best, there are flaws in the selection of the standardization group. 
But when the work is done by the best and most careful and 
thorough methods, the approximation is good enough for satis- 
factory practice. The improvement of sampling techniques is prob- 
ably not a major point in the evolution of better tests. At any rate, 
it is not a major point in current Psychometric research. 
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A much more immediately cogent question is: Against just what 
population is a test being projected and its interpretive norms 
developed? One might say that if a test is to measure intelligence, 
or musical talent, or power of inductive reasoning, or schizoid 
tendency, it should surely contemplate these functions in the 
human race at large. They are supposed to be general or universal 
psychological functions, which means that they may be mani- 
fested by all men in some degree. This, in a sense, no doubt, is true. 
Surely, then, it is necessary to compare or rank or rate the indi- 
vidual against his universal human manifestation. And if so, the 
only adequate standardization group will be a true sample of the 
human race. 

There have indeed been dreams of such universal tests, as 
Schieffelin and Schwesinger point out (g.v.) But in the main they 
are only dreams. The only functions it is possible to measure on 
any such inclusive scale are the simple sensori-motor processes, 
such as reaction time or intensity discrimination. And even then 
there are the gravest difficulties. Every existing mental test is 
projected against a more or less explicitly selected population, 
and its standardization group is chosen accordingly. Thus the 
Revised Stanford-Binet scale was standardized in terms of a popu- 
lation consisting of native American whites. And there is the 
same limitation in all other tests. When applied to Negroes, or 
to Indians, or to English-speaking Mexicans, or to some very 
special socioeconomic group, the test goes beyond its contemplated 
application, strictly speaking, and there is always the question of 
whether its norms may not give false interpretations. 

One suggested way out has been to develop special group norms 
based on racial or socioeconomic classifications or school grades 
(v. Freeman, 1939). This, however, is exceedingly difficult to 
do, and it has not been done on a large scale, although there have 
been some experimental attempts, such as dual standardization 
on city and country children. Also, there is the question of whether 
we want to conipare an individual with his own particular limited 
background or against a wider setting. Shall we compare Negroes 
with Negroes, rural children with rural children, and so on? Or 
shall we take some wider criterion group, including rural children, 
city children, children of professional, and business, and skilled, 
and unskilled laboring parents in about the proportion of these 
classes in the general population? Which will yield the more 
meaningful interpretations? 
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This final question itself implies the answer. It depends upon 
the purpose for which the test is intended. On the one hand, there 
has been a development towards the special purpose mental test. 
Army Group Intelligence Examination Alpha itself, in its original 
form, was in effect such a test although not a highly evolved 
example of the type. The University of lowa Placement Tests are 
a better example. One cannot say that they measure simply aca- 
demic aptitude, still less “University of lowa aptitude.” What 
they do measure is intelligence as it functions in the curricular 
setting of the University of Iowa. Therefore, it is proper and 
desirable to standardize it on a sample of this intelligence. So, if 
we think of the census group instead of the racial group, and want 
to measure the inductive reasoning ability of children of Welsh 
descent in Wilkes-Barre, there ought tc be special interpretive 
norms for this purpose. It may be perfectly proper to use a test 
more broadly standardized if a broader comparison is desired, but 
the true meaning of the scores must never for a moment be for- 
gotten. Thus one tendency is towards standardization in terms of 
the special functional group for which the test is intended. And a 
functional group, be it noted, is not a mere classification, such as 
that of all persons earning between $5,000 and $10,000 a year. It 
is a group brought together by some community of purpose which 
makes it significant, such as candidacy for medical school, or for 
a given college, or perhaps life on a canal boat. This practice of 
standardizing tests on defined functional populations is a distinct 
and growing development. 

On the other hand, interest may center upon the mental process 
itself, and then the more inclusive its representation the better. 
To standardize a test of primary mental abilities on a limited 
functional group would certainly be questionable. Limits there 
will undoubtedly be. One would presumably not include Hotten- 
tots or Eskimos in the sample, unless experimentally. But the 
wider those limits are, the sounder the logic would appear. 

Which of these two trends contains the greater promise is an 
open question. Both are present today, and both no doubt will 
continue. The important thing is to understand them, and to see 
them as involving a quest for more intelligent and discriminating 
procedures of standardization and interpretation in the light of 
the purposes contemplated. 
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NEW OPERATING CONCEPTS 


The quest for new operating concepts around which tests can 
be constructed is today being followed along two widely different 
lines. The first is the realistic study of the function to be measured 
as it appears in some actual setting, this being an extension of 
the idea of the special purpose psychological test. The second is 
the technique, or rather the body of techniques, known as factor 
analysis. 


1. The study of function 


The practice of building tests around a realistic study of the 
function to be measured is, of course, not new. But it has been 
excellently exemplified in much of the psychometric work done 
in the United States Army during World War II. It seems very 
probable that this work, like that done in World War I, will set 
the patterns for many future developments. The tests themselves, 
being intended for military uses, for the most part may not be em- 
ployed widely, as was Army Group Intelligence Examination 
Alpha. For this reason they have not been described at length in 
this book. Instances of them are considered here, not for their own 
sake, but to illustrate the basic principle of their construction, 
which is likely to find extensive application. 

A good example is the development of a test for radiotelegraph 
operators (v. Staff, Personnel Research Section, Classification and 
Replacement Branch, Adjutant General’s Office, 1943 a). A survey 
of the situation revealed that the most common cause of failure 
among radiotelegraph operators in training was the failure to 
learn code. Accordingly, two chief tests were constructed. The 
first was the Radiotelegraph Operator Aptitude Test. It consists 
of items in general like those of the Rhythm Test in the Seashore 
Measures of Musical Talent. Pairs of code patterns are presented 
aurally. The series increases in difficulty. The second member of 
each pair is to be judged the same as or different from the first. 
The second test was the Code Learning Test. This consists of a 
30-minute learning period centering upon 6 code characters. After 
this there is a test with the 6 practiced characters presented 12 
times each as part of a total set of 40, the additional 28 consisting 
of unpracticed code characters. The task is to discriminate the 
practiced characters. Both tests were found to correlate well with 
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the ability to learn code up to a given speed as measured by time 
required. 

Another instance is the Aviation Cadet Qualifying Examination 
(0. Staff of the Psychological Branch, Office of the Air Surgeon, 
1945). This test was developed through a long series of refine- 
ments. A vocabulary subtest was considered but rejected. In final 
form it consists of the following subtests: Pilot Aviation Interests, 
Avocational Interests Related to Aviation, Driving Information, 
Hidden Figures (a spatial perception test), Mechanical Compre- 
hension (tracing and understanding mechanical relationships). 
The best of these, in terms of correlation with training Success, 
were Mechanical Comprehension and the two interest tests. 

Another instance of test development is the construction of a 
battery for identifying potentially successful naval electrical engi- 
neers (Lawsche and Thornton). The final battery which emerged 
out of a series of experimental attempts, involving correlation 
with the criterion of achievement in the training course, consisted 
of a test of ability to read simple measurements and to solve 
arithmetical problems, another on electrical information, and a 
third revealing general alertness. The above was the order of 
predictive efficiency against the criterion of training course grade 
points. 

Even a recorded failure to construct a valid test is not without 
its instructive aspect. The attempt was made to build a test for 
the selection of truck drivers. A number of visual and motor psy- 
chophysical tests were tried out, but had little selective value in 
terms of the criterion. The same was true of an experimental 
driver-information test. The general conclusion was that the best 
test was simply a tryout on the road (v. Staff, Personnel Research 
Section, Classification and Replacement Branch, Adjutant Gen- 
eral’s Office, 1943 b). It would seem that in this case the criterion 
Was so elusive, and the operative conditions of success so in- 
accessible, that no test could be built. 

The relative definiteness and immediacy of the criterion, and 
the clarity with which it was possible to formulate the basic con- 
cept embodied in the test would seem to explain two notably suc- 
cessful features of mental measurement during World War II. 
The first of these is the wide use of brief tests for screening pur- 
poses, which has been mentioned elsewhere in this book. As Guil- 
ford (1946) and others have pointed out, when a very well- 
defined process of selection was involved, or when it was desired 
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to make sure that certain minimal requirements were met, con- 
ventionally accepted standards of reliability and validity became 
unnecessary. The second notable success was in the field of per- 
sonality measurement. We have seen that, with a very few excep- 
tions, personality tests have not commended themselves highly 
to psychologists for general use. But during the war they were 
found very serviceable (Hunt and Stevenson, 1946). The reason 
was the simplicity of the problem. That is to say, there was a very 
close relationship between various pathological manifestations 
and unfitness for service. Sonnambulism, for example, may not 
usually be very serious in civilian life, but in war it can be disas- 
trous. So, too, for many other personality and behavior disturb- 
ances which can be indicated successfully in a relatively short 
and simple test. 

Aircraft pilot trainee selection is an outstanding example of the 
use of analytic techniques as a basis for the construction of 
effective tests. Guilford (1945) writes as follows: “It was found 
that the 21 scores offered by the classification battery measured 
only eight of the factors that appear to be positively loaded in 
the pilot-training criterion. All of these factors, incidentally, are 
foreign to the usual intelligence test. The use of intelligence tests 
for the selection of pilots among those whose I.Q.’s aré above 100 
would be practically futile. From the estimated factor loadings 
of these eight factors in the pilot criterion, it could be predicted 
that these factors, optimally weighted in the test composite, would 
yield a validity of about .60 for that composite. This was not far 
from the validity actually obtained. From results with experi- 
mental tests, it was estimated that there were nine other factors 
having positive loadings with the pilot criterion. Had the classifi- 
cation battery included them, properly weighted, the validity of 
the composite would have been about .70. There were four other 
factors in which the pilot criterion appeared to have very low 
negative loadings. With these factors also included and appropri- 
ately weighted, the multiple correlation should be about ‘72. At 
least two unknown factors that appeared to have substantial pilot 
validity were not included in these considerations. New factors 
were still undisclosed but indicated before the end of the war. 
With one or two exceptions, the 21 factors with some claim to 
recognition in the pilot criterion would ordinarily be called 
abilities. Whatever variances were contributed to the criterion by 
temperamental factors were almost untouched. The conclusion 
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would be that the upper limit of validity for any battery is an un- 
known quantity. Any estimate of it needs to be liberal and sub- 
ject to revision as new factors come into the picture. Incidentally, 
the number of human factors, when they are much better known, 
will probably run much larger than has been supposed. The hori- 
zon of aptitudes is slowly but surely extending beyond the con- 
fines of the I.Q. It is hoped that the horizon of temperament will 
also grow beyond the .concepts of neurotic tendency and the 
P.O.” (Pp. 434). 

The assertion may be ventured that the outstanding single out- 
come of the experience of measurement in World War II is the 
realization that psychometric advance does not depend upon 
superficial devices or quasi-clerical improvements, but upon a 
clearer, more practicable, and better analyzed notion of what is 
to be measured. In other words, as Jenkins (g.v.) puts it in a 
paper to which reference has already been made, the outstanding 
lesson is the need for better analytic techniques for validation. 


2. Factor Analysis 


Factor analysis is a technique or group of techniques which is 
coming more and more into prominence. There have been numer- 
ous references to its application in these pages, and although no 
comprehensive account is offered here, a general appraisal is in 
order. It is essentially a method of clarifying and defining basic 
concepts, not indeed inconsistent with more empirical and trial- 
and-error procedures, but introducing new controls and refine- 
ments. It will, for instance, be remembered that the California 
Test of Personality is built on what is essentially an empirical, 
though very carefully considered, classification of types of adjust- 
ment, because the factorial studies undertaken by the authors had 
not yielded any better or more fruitful foundation for test con- 
struction. If analytic studies of personality such as those con- 
ducted by Raymond Cattell (g.v.) should in the future give 
rise to feasible working concepts, presumably they would be used 
in the construction of what might turn out to be superior per- 
sonality tests. These instances and comments are presented to 
make clear the place and significance of factor analysis in the 
general picture of psychometrics. 

A. Factor analysis is essentially a process of simplification. It is 
a mathematical technique, or array of techniques, by which a large 
heterogeneous set of measurements can be expressed in terms of 
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a few simple concepts. The idea applies far beyond psychology. 
For instance, in measuring and plotting the surface of the earth, 
it is possible to express what would otherwise be a baffling array of 
figures in terms of the two “factors” of latitude and longitude 
(Burt, 1944). Or again, we may take a set of anthropometric 
measurements, i.e., measurements of the human body, and show 
that all the relationships among them can be expressed in terms of 
length, girth, head size, and size of extremities (Thurstone, 1046). 
Or, to take an illustration closer to our present topic, consider a 
final examination in a course in psychology. When the examina- 
tion, which consists, let us say, of a lengthy Objective test, has 
been scored, there is a large array of figures. How are they to be 
interpreted, if interpretation is desired? It might be argued that 
these scores express first of all some very inclusive and general 
influences, such as the student’s motivation, or his reading ability, 
or perhaps his over-all knowledge of Psychology. These we might 
call general factors, or even lump them all together, and call them 
the general factor in the test performance. Then there would be 
some students who would have a special knowledge of this or that 
specific psychological topic or topics, and various items calling 
for such special types of knowledge. Such knowledge would reflect 
itself in the tesi scores. Then, if the test were very inclusive, it. 
might to some extent depend on a still more special and limited 
consideration, namely, the mathematical skill possessed by a few 
of the students and required in a few of the items.* So it would 
be possible to argue that our array of test scores, and all the in- 
terrelations they display, express or involve the operation of this 
set of factors. If we were incautious we might even say that the 
test performances were caused by these factors. But this would 
be a rash statement, because Possibly someone might be able to 
suggest quite a different set of factors that would seem to explain 
the data just as well. Hh 

B. Factor analysis is essentially a search by statistical methods 
for the psychological processes which underlie and determine test 
performance. One might say that it is a search for what is really 
being measured in tests. This is not usually well revealed by their 
titles. There are tests, so-called, of intelligence, aptitude, talent, 
attitude, interest, personality, value, and so forth. But these terms 
and the concepts for which they stand are by no means clear. 
Nor is the matter helped very much when the test items them- 


* I owe this illustration to Cronbach (1947). 
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selves are scrutinized. Intelligence tests often contain memory 
items, verbal items, numerical items, items requiring discrimina- 
tion of space relationships. Does a single mental process exem- 
plify itself in all these types of performance? Is there anything 
in common between verbal and arithmetical reasoning? If so, 
What is it? Or between vocabulary and the power to understand 
a difficult paragraph? Or between the task of telling what a sheet 
of paper will look like when unfolded after having been folded 
and cut, and the task of running a maze? Does the same ability 
exemplify itself in arithmetical computation and in problem solv- 
ing? Does aesthetic discrimination enter in with intelligence tests 
that use pictorial material? It is to such questions as these that 
factor analysis is directed. 

The general procedure is an extension of the correlational tech- 
nique. A correlation matrix, such as that in Figure 32 is set up, 
showing the intercorrelations of a number of tests. When the 
unreliability of the tests concerned is allowed for, the coefficients 
Show the true relationship or “commonality” between the various 
tests. Some have much in common, some have less, some have 
very little. The tests that have high intercorrelations are pre- 
sumably measuring the same thing to a considerable extent, or 
are “saturated” with the same factor to a considerable extent. 
Those with low intercorrelations embody different factors in the 
main. But all have at least something in common. On these 
assumptions mathematical analysis is applied to determine how 
many factors arc needed to account for the observed relationships, 
and also to determine as far as possible what these factors are. 


্‌ Comple- Discrimi- | Cancella- 
Opposes 1 Memory nation tion 
Opposites ...... — .80 .60 30 -30 
Completion .... .80 — 48 Et -24 
Memory ....... .60 48 — IS IS 
Discrimination . . .30 24 .I8 — .09 
Cancellation ... | .30 Et .I8 .09 3 


FiG. 32. HYPOTHETICAL CORRELATION MATRIX 
(Spearman, 1927) 
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Putting the matter more generally still, any correlation matrix 
can be factored This is not a question of Psychology at all, but 
of mathematics. So it is possible to take a set of correlations of 
test scores and express them in terms of relationships between a 
much smaller number of formulations which are the factors of the 
matrix. If we simply have masses of raw test Scores, they tell us 
very little. If we take those derived from an intelligence test, and 
those derived from a mechanical aptitude test, and correlate them, 
we have greatly simplified our situation and have obtained a 
statistic that may show us something quite important in our data. 
If, now, we have a whole array of correlations among test scores 
and reduce them to the interrelationships of a small number of 
factors, exactly the same process of simplification and clarification 
has gone on. Once more we have statistics that make our data 
more intelligible and more manageable. This is the central notion 
involved in factor analysis. 

A number of schools, sub-schools, and, one might almost say, 
sects have grown up under the heading of factor analysis, which 
exemplify various interpretations and methods of operation. Only 
a few of them will be mentioned here. 

(a) The outstanding characteristic of the view of Spearman 
(1927 a, b, 1939), who may be considered the Originator of factor 
analysis, is that a large number of different kinds of tests involve 


sites, arithmetical reasoning, vocabulary, classification, etc. The 
third cannot. Thus we cannot at present measure all aspect of G, 
though we can measure enough for significant Practica] purposes. 
The general factor may also be thought of as intellective energy, 
i.e., what Gestalt psychologists call Gestaltungskraft, the mental 
drive which integrates and structuralizes a psychological field. 
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Spearman regards this as in substance a proof that completions, 
arithmetic, vocabulary, and directions exemplify a general factor 
(v. Spearman, 1927 b). 

Besides, the general factor involves factors of two other kinds. 
The first are those unique to it, i.e., the “special” factors peculiar 
to solving geometrical problems, or to understanding prose, but 
not common to both. Second are the “group” factors common to 
a number of performances, but not so universal as G. For instance, 
there is much in common between cancelling A's and E's in a 
prose passage that would not appear in handling geometrical 
problems. Thus, for Spearman all mental performance is due to 
the general factor, certain special factors, and certain group fac- 
tors. Such, according to him, is the organization of intellect. 

(b) The view most sharply opposed to this is that of Thurstone. 
In his monumental work published in 1938, he gave over fifty 
tests to some 250 undergraduates, and computed about 1,500 
correlations. These correlations he explained in terms of a number 
of more or less independent factors which he called primary 
mental abilities, among them being number facility, word fluency, 
Visualization of space, memory for words, perceptual speed, verbal 
reasoning, and induction. In his later work he has modified this 
list somewhat and further defined some of the primary abilities 
(v. Thurstone and Thurstone, 1941; Thurstone, 1940). Another 
set of factors includes a general factor, a mathematical-mechanical 
factor, a verbality factor, a spatial relations factor, a memory 
factor, a mental speed factor, a deductive reasoning factor, and 
a motor speed factor (Holzinger). In Thurstone's earlier work the 
general factor, or G, does not appear. Some years ago, however, 
Spearman (1939) argued that this is due to his statistical proce- 
dures, and that if Thurstone’s data are handled by a different 
method, a true general factor appears in them. Since then Thur- 
Stone (1944) has published work demonstrating what are known 
8s second-order factors. First-order factors are derived from and 
involved in the test correlations themselves. Second-order factors 
are derived from and involved in the first-order factors. Thur- 
stone provides an interesting and instructive illustration of the 
idea. Let us suppose, he say's, that we have a number of rectangu- 
lar boxes of many different sizes, and also a set of measurements 
of each—the diagonal of the front, the area of the top, the length 
of the vertical edge, etc. These measurements would correspond to 
the test scores, and each box to an individual for whom scores 
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had been obtained. When the various sets of scores had been 
correlated, and the coefficients arranged as a matrix, this matrix 
could be factored into three primary factors, namely, length, 
width, and height. Then these three primary factors could again 
be factored into a single secondary factor, which would be called 
the size factor. Thurstone has expressed the view that what Spear- 
man called the general factor may reappear as a secondary 
factor. 

(c) According to a third viewpoint, mental functioning is the 
result of innumerable shifting influences rather than a few defin- 
able and permanent factors (Thompson, 1939). Akin to this is 
the view of Thorndike (1927) who regards association or “con- 
nection forming” as the basic mental process, so that the higher 
mental activities are those which involve larger numbers of con- 
nections. Thorndike's work, it should perhaps be said, is not 
usually considered as belonging to factor analysis in the strict 
sense. This last position is by no means so far removed from that 
of Spearman as it might seem, for the general factor, or G, might 
be thought of as an aggregate or class of special connections or 
Processes rather than as a sort of psychological thing in itself. 

It would seem, then, that the three major viewpoints in this 
field are, in certain respects, though not in all, not Wholly in- 
compatible. We must, however, remark that this brief sketch 
gives a very inadequate notion of factor analysis. Its techniques 
are being widely and actively applied. New results are constantly 
being reported Leading workers change their opinions. But what- 
ever the account offered, the thing that should be understood is 
that the aim and end of analysis is first to determine what the 
basic factors are, then to determine the relative “loadings” of the 
various tests used with the factors arrived at, i.e., the proportion 
of test performance due to each factor, and finally to reconstruct 
the tests in order to make them, so far as possible, “factorially 
pure. ঠ 

C. The practical outcome of factor analysis is to provide new 
and, it is hoped, better operating concepts for test construction. 
Factor analysts criticize such tests as the Binet as composite 
jumbles of all kinds of tasks assembled ad hoc without any clear 
rationale or organizing idea. The hope is to be able to build tests 
whose psychological content will be clear and definite. 

The best present instance is the Chicago Tests of Primary 
Mental Abilities, which involves six factors Number ability re- 
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vealed in rapid and accurate calculation, Verbal ability revealed 
in verbal comprehension, Spatial ability revealed in the imaginary 
manipulation of spatial forms, Word fluency revealed in produc- 
ing numerous words in a given time, Reasoning ability revealed 
in finding relationships in presented material, and Memory ability 
revealed in rote memory. How these constituent factors are set 
Up in the tests will be seen in Figure 22. This is the outcome of 
the continuing work of Thurstone, which led to the conclusion 
that the six factors named were clearly enough isolated to be used 
in test construction (Thurstone and Thurstone, 1941). Another 
practical example is the work of Flanagan (g.v.) with the Bern- 
reuter Personality Inventory. He found it possible to reduce the 
Original four trait divisions (Neurotic tendency, Introversion- 
extraversion, Dominance-submission, Self-sufficiency) into two, 
namely, sociality and self-confidence. These have been built into 
the scoring key, and can be used in place of the other four. This 
is an excellent instance of what is involved in factor analysis. 
Yet another example is the application of factor analysis to the 
Strong Vocational Interest Blank for Men. This yielded five basic 
interest types, to wit: Interest in people, Business, Intellectual 
activities, Science, and Language (Strong, 1934, 1943; see also 
Thurstone, 1937). Yet another instance is the California Tests of 
Mental Maturity, which purport to measure Memory, Spatial rela- 
tionships, and Reasoning. These are said to have been arrived at 
by means of factor analysis, and to be the basic psychological 
components of the battery. Spearman has not undertaken to con- 
struct a test battery for the measurement of G, so nobody can say 
What it would be like. However, Stoddard (1943) has suggested 
that it might well resemble the Stanford-Binet scale with certain 
modifications. 

For a genera! appraisal of the over-all significance of factor 
analysis, the most crucial question that can be asked is this: Just 
What are the factors that analysis discovers? Specifically, are they 
akin to actual psychological entities, or causes, or forces, as 
faculties and instincts were once supposed to be? That a correla- 
tion matrix can be factored is a mathematical truth. But this in 
itself tells us nothing about what the obtained factors correspond 
to. This is a great deal more than a purely theoretical question. It 
so happens that some counselors administer a factorial test, such 
as the Thurstone Primary Mental Abilities battery, work out the 
indicated factor profile, and then give very confident vocational 
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advice on the result, apparently in the belief that they are pro- 
ceeding on a “scientific basis.” 

There is no doubt that factor analysis has at least a superficial 
resemblance to a return to the older faculty psychology. Burt 
(1944) has remarked that many of the factors that are being 
announced are given almost exactly the same names that Gall long 
ago attributed to his mental faculties. Also, some very distin- 
guished workers in the field have in the past declared, almost in 
so many words, that factors are mental faculties under another 
name, the only difference being that faculties were arrived at a 
priori while factors are arrived at by dint of statistical analysis. 

Yet such a viewpoint seems hard to defend. As to the predictive 
value of a factor pattern, it has certainly not uniformly or uni- 
versally proved of significance. In discussing the Thurstone 
Primary Mental Abilities battery, and other similar instruments, 
it was pointed out that they do not seem any more closely related 
to the usual criteria, e.g., success in school, than the familiar 
“global” tests. So, too, Goodman (g.v.) and Ellison and Edgerton 
(g.v.) find little relationship between factor scores and achieve- 
ment in school subjects that might be thought likely to be associ- 
ated with them. Also, Stuit and Hudson (g.v.) and Adkins (q.v.) 
report that the factor patterns revealed by the Primary Mental 
Abilities battery have little relationship to vocational fitness and 
choice. If the factors that are being discovered really were basic 
operational constituents of the human mind—if they were indeed 
genuine “faculties”—then surely an individual's factor pattern 
would have a decisively greater predictive significance than his 
global score, let us say, on the Wechsler-Bellevue Test. 

Moreover, the point is often made that the factors involved in 
many of the studies are derived by an analysis of test responses. 
But such responses are themselves limited, and it is highly prob- 
able that much in the human mentality—inventiveness, creative- 
ness, initiative, etc.—may not appear in them at all, or only very 
meagerly. 

What, then, it may be asked, is the value of factor analysis? 
The answer does not seem difficult. Any legitimate simplification, 
any rationalization, any ordering of a field of complex data has 
many advantages, both theoretical and practical. When we derive 
a measure of central tendency, or a measure of relationship, we 
find out something valuable about our data, considered as a field 
of order. When we show that our data can be handled in terms of 
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a few clear-cut concepts, something of great importance has been 
achieved. We need not believe that an average “exists,” or that a 
correlation “exists.” They are merely conceptual tools. But this 
does not detract from their importance both for thought and prac- 
tice. So also with mental factors. As Burt (1047, P. 97) puts it: 
“What distinguishes factor analysis, therefore, from other ways of 
discovering how individuals and their numerous attributes can best 
be classified, is chiefly this: whereas the ancient logician reached 
his definitions by examining the meanings of words, the modern 
factorist reaches his classifications by examining the correlations 
between forms of behavior to which these words very loosely refer. 
But the ulterior object is still the same; and, whether we are 
describing persons or traits, the factorial concepts adopted are 
simply principles of classification.” 


TAPPING NEW ,PSYCHGLOGICAL PROCESSES 


A development of major importance in recent years has been 
the increasing use and investigation of projective tests, which are 
directed towards psychological processes touched only slightly 
and unsatisfactorily or not at all by psychometric instruments. 

Students of projective testing tend to express themselves in a 
very polemical fashion, to make extremely sweeping claims, and 
to cast much disparagement upon psychometric methods gener- 
ally. Thus Klopfer and Kelley (qv. P. 13) say: “Out of the need 
to bridge the gap between merely subjective ‘understanding’ of 
another personality gained through clinical observation, and the 
objective measurement of individual differences with little or no 
understanding of their origin or deeper meaning, there developed 
a new approach which may be described as in the above quota- 
tion, by the term ‘projective methods of personality diagnosis’.” 
Rapaport, Gill, and Schaefer (g.v.) intimate in effect that mental 
testing has been held in a strait jacket largely by the prestige 
of the Binet scale and its revisions, and contrast the work of the 
“JQ. testers” with projective procedures to the disadvantage of 
the former. And Sargeant (g.v.) finds in the new development noth- 
ing less than a transformation from mechanistic to dynamic and 
holistic psychological interpretations. Perhaps such extreme con- 


tentions are understandable in workers in a relatively strange and 


novel field which has not yet gained complete recognition and is 
subject to many misunderstandings. How justifiable they are 
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we shall have to inquire after considering three of the most impor- 
tant projective instruments. But there is no doubt that projective 
methods are highly significant in opening up for investigation 
areas of mental life with which psychometric methods have not 
adequately dealt. 

As to the concept of projection, Rapaport, Gill, and Schaefer 
write: “In this sense a projection has occurred when the Ppsycho- 
logical structure of the subject becomes palpable in his actions, 
choices, products, and creations” (Vol. II, p. 7). The human 
mind, they point out, does not merely receive impressions, but 
always reacts to them in terms of its own characteristics. ‘Thus 
any reaction, even that of simple perception or association, is 
indicative of the personality, and may be considered a projection 
of it. Any reaction, that is, is determined not only by the object 
reacted to, but by the subject who reacts, and the characteristics 
of the subject are more or less revealed in it. The creative work 
of an artist is a projection and revelation of himself. So are the 
responses of a subject when he is asked to give free associations 
to a list of stimulus words, or to tell what story is suggested to 
him by a picture, or to say what he sees in cloud shapes or ink 
blots. This idea is the working basis of projective testing. 

Projection takes place in a vast number of varied situations, a 
great many of which have been used, with varying degrees of suc- 
cess, for clinical and diagnostic purposes. J. E. Bell (g.v.) in his 
extremely thorough account of the field lists storytelling, the in- 
terpretation of cloud pictures, the expression of likes and dislikes 
for photographs of faces, the analysis of handwriting, drawing, 
and painting, finger painting, picture completion, and vocal expres- 
Sion among some of its manifestations Which have been found 
significant. All this in addition to the well-known and widely used 
projective tests. 

The instruments themselves are techniques for eliciting, observ- 
ing, recording, and interpreting Projective responses. A stimulus 
situation is set up which is as economical as possible in the sense 
of not being time-consuming, as impersonal as Possible, standard 
for all subjects, and limited to one segment of behavior. The 
examiner observes and records the subject’s responses, and inter- 
prets them partly in the light of his clinical experience and partly 
with reference to an organized body of interpretive data. Three 
such instruments will now be briefly described. 
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1. Word Association Test * 


This is the oldest of well-known projective tests. The first 
extensive report of its uses was made by Jung in 19109. It con- 
sists of a list oi stimulus words which are presented orally to the 
subject, with the instruction to respond by saying the first word 
that occurs to him. Jung originally used a list of 100 stimulus 
words. This list was revised by Rosanoff. Rapaport, Gill, and 
Schaefer have made a further revision, reducing the number to 
60. The basis of choice was in favor of words that would tap many 
areas of ideation, conflict, and maladjustment. Many of the words 
have familial, domestic, oral, anal, aggressive, and sexual con- 
notations. Also, many of them are nouns. The content and speed 
and general emotive characteristics of the subject's responses to 
the words are the indications upon which interpretations are built. 
Thus, if the response to “father” is “tyrant,” it would, taken 
together with cther indications, be considered significant. If re- 
action is very slow and difficult, or if the subject can think of no 
Word at all, these are considered signs of emotional disturbance. 
Close associations such as “house—my house” are set off against 
distant associations such as “lamp—turkey.” Such signs, in con- 
junction with the whole clinical and personality picture, are used 
as diagnostic criteria. Out of experience in the use of this test 
there has been built up a considerable body of interpretive ma- 
terial on which the examiner can draw to assist his diagnosis. 


2. Thematic Apperception Test ft 

The material for this test consists of three series of ten pictures 
each. One series is for both men and women, one is for women, 
one is for men. Most of the pictures show human beings in various 
attitudes and relationships—approaching one another from a dis- 
tance, looking out of the Window, etc.—or with marked facial 
expressions. The subject looks at the pictures one by one. Hes 
asked to tell the examiner what the situation represented is, what 
events led up to it, what outcome is probable, what the thoughts 
and feelings of the characters are. The examiner makes a written 
record, as far as possible complete, of what is said. Speed and 
readiness of response, content of response, misrecognitions of pic- 
tured objects, and the aspect or portion of the picture on which 


* References; Rosanoff; Rapaport, Gill, and Schaefer; Jung. 


tT References: Murray, 1938, 1942. 
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the subject centers his attention are among the indicative Signs. 
Thus, if a response to a picture of a boy sitting with a violin 
before him on the table is that he is being tyrannized over by his 
- parents and will soon capitulate although unhappy and rebellious, 
this begins to suggest a certain personality orientation. In short, 
the “story” in all its aspects which the subject produces is treated 
as a projective manifestation, and so is failure to produce any 
“story” response at all. As with the Word Association Test, a 
body of interpretive material has been assembled and analyzed 
for the assistance of the examiner in arriving at his diagnosis. 


3. The Rorschach Test t 


This is the best-known and most important of all existing pro- 
jective tests. Particularly since the Rorschach Research Exchange 
(g.v.) began periodical publication in 1936, very wide discussions 
and investigations of the test have been conducted, and enormous 
amounts of interpretive material have been assembled. 

The test itself consists of 10 large ink blots, 5 in different shades 
of gray, 2 in gray with one shade of red, 3 entirely or almost 
entirely in color. They are printed on separate cards and pre- 
sented one by one to the subject, who is asked: “What could that 
be?” or “What do you see here?” It is necessary to conduct the 
test in a quiet room, engaging the subject’s full attention. Tech- 
niques for group administration have been developed, with the 
ink blots thrown on a screen (Harrower-Erickson; Harrower- 
Erickson and Steiner; Munroe, 1942). Apparently group admin- 
istration is fairly satisfactory, although there are some doubts. 

The subject, of course, makes a verbal response, telling what 
he sees in the ink blot. Each response is scored with reference to 
5 categories.* These have to do, not with the direct content of 
what the subjects sees, but with the mode of his seeing. (1) The 
first has to do with the “location,” or “area” or apperceptive mode 
of the response. It may be to the pattern as a whole, to its large 
detail, or to its small detail. (2) The second category has to do 
with the content of the response, not directly but as manifesting 

a certain type. A response may involve human figures, animal 
concepts, nature and geography, plants and botany, art concepts 
and abstract concepts. (3) The third category has to do with the 
fT References: Klopfer and Kelley; Beck. 
* There is considerable variation among scoring plans. A brief statement such 


as this cannot take cognizance of them all. The present account is based upon 
Beck, Klopfer and Kelley, and Rapaport, Gill and Schaefer. 
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determinants of the response, i.e., the elements in the stimulus 
which are prepotent. The response may involve human-like, 
animal-like, or minor movement. It may be to the shading values, 
the color values, or the form of the ink blot. (4) The fourth 
category has to do with the form level of the response. In con- 
ception the response may range from vague and arbitrary to 
accurate and definite. (5) The fifth category has to do with the 
popularity of the response. It may range from very common and 
typica! responses often made to very rare and idiosyncratic. De- 
tail, for instance, may be “normal” or “unusual.” In addition to 
ratings on these five categories, the total number of responses is 
noted. Symbols are used by the examiner in classifying each 
reaction on the system described. Interpretations are based upon 
these categorized scores. 

Some idea may be given of the interpretations based on the 
scores. According to Beck (go.) a normal individual may be 
expected to give gr responses in all. Of these, 6 will be to whole 
patterns, 21 to normal or ordinarily selected detail, and 4 to 
unusual detail. Deviations in the direction of more whole-wise 
responses may suggest a tendency towards broad generalization 
and survey. When extreme, they may suggest an expansive per- 
Sonality neglectful of detail. Deviation towards more detail re- 
Sponses indicates a personality type likely to attend to concrete 
matters and to approach problems practically. When extreme 
it suggests pedantry, meticulousness, and overcaution. Feeble- 
minded persons usually cannot see the blots as meaningful wholes, 
and tend to interpret the parts in a stereotyped and obvious 
fashion in terms of common objects. Interpretations in terms of 
imagined movement may indicate fantasy, delusion, or inventive- 
ness. Emphasis on color may suggest impulsiveness, egocentricity, 
and emotionalism. Emphasis on form indicates intellectualism, 
steadiness, and perhaps introversion. The prevailing responses of 
the normal well-adjusted person are in terms of form, though 
color and shading are usually mentioned. Repetition of the same 
tesponse to different cards may indicate stereotypy as in feeble- 
mindedness. The ratio of usual to unusual responses is often an 
indicator of originality. This may give the reader at least some 
Eeneral notion of the type of interpretation yielded by this very 
important test. As an additional instructive and interesting illus- 
tration, Bleuler and Bleuler (g.v.) gave ‘the Rorschach Test to 
29 Moroccan peasants, and compared their responses to those of 


420 PSYCHOLOGICAL TESTING 


normal European adults. The Moroccans gave few integrated 
responses. Thev produced fantastic interpretations of separate 
details of the patterns. There were few abstractive generaliza- 
tions. Their qualitative responses compared to those of Europeans 
were similar when made to shape only, to color only, and in a few 
other respects. The authors find these test responses compatible 
with the general mode of life of these people. 

Since the Rorschach Test has become widely used in America, 
the question of standardizing it has arisen. Rorschach himself 
was opposed to this, and Rapaport (1939) is critical of the sug- 
gestion. Some attempts, however, have been made. Hertz (1035), 
for example, has worked out norms on 300 adolescents in junior 
high school for the various Rorschach categories, and compared 
these to norms obtained with other groups. Three years later 
(1938) she published scoring lists for the Normal Detail category, 
Worked out statistically on the assumption of normal distribution, 
instead of being accumulated from clinical experience, as with 
other lists. But such work has been scanty, and has been criticized 
in principle by many experts. 


Psychotic and neurotic groups and a normal control group. To 
mention another instance, Munroe (1943) found that the test 


The question of validity 
those having to do with st. 
debate that goes to the h 


cannot be other than paramount, but 
andardization and reliability involve a 
eart of projective testing. Should pro- 
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jective tests be standardized? Should definite norms be worked 
out which would more or less “automatically” interpret the varied 
and multiple responses elicited? It is very doubtful. Interpreta- 
tion depends upon the total picture, which must be assessed by 
the wisdom and experience of the examiner. The standardization 
group technique would seem legitimate only where the test re- 
sponses themselves are channelized. So also with reliability, this 
may not have the same value in projective as in psychometric 
testing. As we have seen, it pertains not to the test alone, but also 
to the examiner, the subject, and the total setting. It depends 
largely on channelization, and this the projective tests avoid as 
far as may be. What in a psychometric test would be variable 
errors, here may be important indicators which ought to be 
heeded. The more freely, within limits, the subjects responds, the 
better the chance of the examiner to reach a significant under- 
standing. This may even mean that the reorganization of the 
Rorschach Test for group use is a mistaken development. Pro- 
jective testing should be free to develop its own techniques and 
Procedures, and if it becomes assimilated to psychometric testing, 
there is danger that its distinctive values will be destroyed and 
its potential contribution lost. There is, of course, a complemen- 
tary danger, for if projective testing pushes ahead without any 
Sort of adequate controls, which is entirely possible, all kinds of 
worthless and trashy instruments will be produced, and the most 
fantastic interpretations will be broadcast without let or hin- 
drance. Apparently projective testing in the immediate future 
must steer a course between charlantry and an alien pseudo- 
Scientific rigidity. 

The projective tests are explicitly based upon a dynamic or 
holistic psychology. Perception, it is pointed out, is not determined 
simply by the impact of external objects upon the sense organs. 
Rather it is an interpretive and purposive reorganization touched 
off by such impacts. In the same way association is not the 
mechanical establishment or grinding in of certain connections. 
Rather it is again a process of purposive, meaningful organization .* 
What a person perceives and how he perceives it, what experiences 
he associates together and how he associates them, depend upon 
his mental or personality organization as an organic whole. The 
early psychological interpretations of the Word Association Test 
Were not explicitly founded on this viewpoint, although they im- 


* See Woligang Koehler, Principles of Gestalt Psychology. 
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plied it. Jung, at first at any rate, tended to think of the test asa 
means of discovering the subject’s residual or active associations, 
which were related to and indicative of emotionally potent expe- 
© riences. But today students of projective testing think of any 
response as resulting from and thus manifesting not some isolated 
experience or complex, but the total personality organization. 

The tests described are simply devices for eliciting projective 
and indicative responses which are practically manageable and 
scorable. The difference between the Rorschach Test and the 
Thematic Apperception Test lies in the degree of explicit struc- 
turalization in the stimulus. The latter controls and channelizes 
response more than the former, and is more apt to elicit fully 
conscious and relatively superficial constructions. The same limi- 
tation also applies to the Word Association Test. A limitation 
which applies to all three is the very artificial character of the 
stimulus situations. They are not substitutes for the study of the 
subject’s behavior and of its projective manifestations in wider 
and more normal settings. A man’s reaction to ink blots may to 
Some extent show what sort of person he is, and it has the advan- 
tage of being convenient and scorable. But it is of course a small 
sample, and may be a distorted sample of his much more revealing 
reaction to the concerns of daily life. 

So far as psychological assumptions and orientation go, the 
Chief difference between projective and psychometric tests is that 
in the former these are explicit and that in the latter they are not. 
A Psychometric test is committed to measurement. A projective 
test is committed to diagnosis. To some extent the two can be 
combined, for there are projective elements in many psychometric 
tests, and psychometric elements in projective tests. But an 
attempt fully to reali 
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Thus psychometric tests are by no manner of means committed 
to a mechanistic psychology. They are merely committed to good 
and workable analyzing concepts. In terms of these concepts they 
yield measurements which are reasonably definite, which is no 
small advantage and which obviously can be and should be inter- 
preted in terms of the total personality and setting of the subject. 
Projective testing is no Copernican revolution in mental measure- 
ment, as some would appear to suppose. But it is a valuable new 
development, whose future authenticity will depend upon avoid- 
ance of the twin dangers noted above. Measurement and diagnosis 
are the two aims of all mental testing. They are by no means 
mutually independent, and future progress will turn upon the 
development of better instruments for their achievement. 
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measures); Chapter 16, “The nature of mental ability” (a discussion 
of factor analysis). 

্‌ Philip E. Vernon, The measurement of abilities (London: Univer- 
sity of London Press, Ltd., 1940), Chapter 8, “Analysis of abilities.” 
An exceptionally clear treatment of factor theories. 
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test results,” Journal of cducational psychology, 16 (1925), 599-618. 
An analvsis of the [.Q. and the C.I. 

Bruno Klopfer and Douglas McGlashan Kelley, The Rorschach 
technique (Yonkers: World Book Co., 1042), Chapter 1, “History of 
the Rorschach method”; Chapter 2, “Methodological problems.” 
te systematic and historical account of the general aspects of projective 
esting. 

Helen Sargeant, “Projective methods: Their origin, theory, and ap- 
Plication in personality research,” Psychological bulletin, 42 (1045), 
257-93. An over-all summary and documentation. 


QUESTIONS FOR DIscusstioN 
1. Would it be true to say that unless the I.Q. were a fairly stable 
measure, the problem of the constancy of obtained I.Q.’s could not 


even be approached? p রহ 
2. What is the advantage of using purely statistical norms and 
measures? Are such norms and measures without psychological 


reference? 
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3. In what respect are the problems in applied psychology in mak- 
ing a special purpose test different from those in making a general 
intelligence test? 

4. Discuss the advantages and disadvantages of standardization on 
functional groups rather than on unselected populations. 

5. If you agree with Terman that the I.Q. involves no assumptions 
regarding the form of the growth curve, does this mean that the rate 
of mental growth does not affect an actual Obtained I.Q.? 

6. Can you think of any other vocations or activities, besides truck- 
driving aptitude, which might be difficult or impossible to determine 
by tests? Why does such difficulty arise? 

7. Have you ever heard or read popular psychological discussions 
Which seem to identify the sort of entities or concepts discovered in 
factor analysis? 

8. Should the apparent reasonableness on general grounds of any 
given factor theory affect its acceptability? 

9. Do we ever place any reliance on projective manifestations in 
Judging people in everyday life? 

10. Does the admittedly analytic character of Psychometric tests 
really imply a far-reaching Psychological orientation? 
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254 

Altitude, 80, 82, 90, 92, 137, 139, 
4IO-II 

American Anthropological Associa- 
tion, 333-34 

American Council on Education 
Psychological Examination for 
College Freshmen, 49, 155, 158, 
164-65, 202, 206, 207, 218, 243, 
325, 400 

American Council on Education 
Psychological Examination for 


High School Students, 165, 3I2, 
318 


Appreciations, 15 


Aptitude, 225-26, 230, 246. 248-49 

Aptitude tests, Ch. VII 

Area, 35, 83, 137, 139 

Army General Classification Test; 
198, 204-5 

Army Group Intelligence Examina- 
tion Alpha, 2, 42, 49, 61-62. 
63, 143-44, 145, 147-51, 163. 
20I-2, 221, 231, 329, 330. 347, 
348, 354-55, 403; First Nebraska 
Revision, 143; Modified Alpha, 
144; Schrammel-Brannan Revi- 
sion, 143-44; Scrambled Alpha, 
144 

Army Group Intelligence Examina- 
tion Beta, 143, 144-45, 147, 1757 
330; Revised, 14s * 

Army Individual Test of Genera} 
Mental Ability, 205 

Army testing, 152, 154. See also 
World War I, World War II 

Arrest, age of, 107, 108, 144, 353- 
54, 391 

Arthur Scale of Performance Tests, 
170-71, 219 

Association, 412 

Atomistic Psychology. See Mechan- 
istic Psychology 

Attitude, 257-58, 286; intensity of, 

285; measures of, 283 ff. 


Attitude scales, generalized, 285-86; 

specific, 283-85 
Aviation cadet qualifying examina- 
tion, 405 | 


Bernreuter Personality Inventory, 


263-66, 267. 399-400, 413 
Binet scale, 10, 38, 39, 40, 43, 61, 
84, 93, 126, 140, 412 


Binet tests 74-75 
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Binet's practice, 97 ft., 118 ff., 120 fi., 
137 ff. 

Bogardus Fatigue Test, 228 

Britgs Analogies Test, 245 

Brightness, 106-7 

British Mental Deficiency Act, 3S0 

Brown Spool-Packer Test, 228 


C score, 178-80 
California First Year Mental Scale, 


96, 185, I19o-9I 

California Preschool Mental Scale, 
184-85, 192 

California Test of Personality, 261- 
62, 407 


California Tests of Mental Maiur- 
ity, 36, 178, 207, 213-15, 400, 
413 

Canal-boat children, 301, 304 

Catch questions, 48 

Census groups, 333, 403 

Chance, 48 

Character, measures of, 280 ff. 

Character Education Inquiry, 289 

Chicago Non-Verbal Examination, 
172 

Chicago Tests of Primary Mental 
Abilities, 208-12, 400, 412-13, 
414 

Choice, 14-15 

Chronogolical age, 106-7, 195, 200 

Church, attitude toward, 283-84 

‘Clapp-Young Self-Marking Device, 
154 

Clinical value, 172 

Coefficient of intelligence, 394-95 

‘Commitment, institutional, 43, 135, 
136, 198 

Community setting, 298-301 

‘Completion tests, 6 

Complex learnings, 14 

‘Complexity, 81 

‘Concentration, ST 

Concept, working, 24-25, 26, 34, 
35-37, 41, 61, 69, 72-77, 9%, 
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102-3, 190, 218, 226, 239, 254- 
55, 259-60, 272, 280, 287, 291- 


92, 375-76, 405-6 

Concepts, formulation of, 14; new 
working, 404 ff. 

Conduct, 290 

Configurational psychology, 23, 25- 
26, 74-75, $0, 93, 410 

Constancy of mental traits, 339 fi., 
374. See also I.Q., constancy of 

Cooperative tendencies, 289 

Cornell-Coxe Performance Ability 
Scale, 171-72 

Correlation matrix, 409-10 

Creative output, 359 

Criteria, 39-45, 198 

Cultural factors, 330 

Cumulative effects, 303-4, 3I1, 312, 
316-17, 322 

Curriculum, vital, 324 

Curtis Test of Arithmetic Achieve- 
ment, 332 yd 


Delinquency, 384 

Detroit Clerical Aptitudes Examina- 
tion, 238 

Detroit General Aptitudes Examina- 
tion, 237, 255, 399-400 

Detroit Mechanical Aptitudes Ex- 
amination, 238 

Detroit Scale for Diagnosis of Be- 
havior Problems, 269-70 

Detroit Tests of Learning Aptitude, 
136-37, 140, 400 

Developmental Examination, 188- 
90, 194 

Developmental Schedules, 188-89, 
351 

Developmental score, 192-93 

Diagnosis, 43, II5, 128-29, 
268-69, 420 

Difficulty, 80-81, 82, 83, 137, 216- 
I7 

Distribution, of mental 
364 ff.; of scores, 220-21 


203, 


traits, 
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Drake Test of Musical Memory, 61, 
250 

Drawing-a-Man Test, 59-60, 174- 
75, 300, 355 


Economy, 81 


Educational achievement, 116-18 
221, 222, 239, 241-42, 242-43, 
315 


Educational promise, 382 

Educational tests, 4-7. See also 
Achievement tests 

Efficiency, 215-16 

Electrical engineers, 405 

Emotional blockages, S1 

Emotional disturbances, 99,112, 172 

Environment, 49, 150, 174-75, 195, 
304, SIT, 312-13, 314, 328, 333, 
340, 347, 350, 353; analysis of, 
376-77 

Equality of units, 138 

Equivalence, rational, 52-54 

Error, constant, 29-30, 45; of in- 
terpretation, 32-33; personal, 
31-32, Variable, 30-37, 46 

Estimating intelligence, 90-92 

Experiment, Psychological, 2-4 


Face validity, 44 

Factor analysis, 3-4, 22, 26, 36, 39, 
77, 93, IIS, 137, 194, 199, 209- 
12, 213, 217, 229, 235, 258, 262, 
265, 275, 400, 407-15 

Factorial purity, 26, IIS-I6, II7- 
18, 217 

Factorial validity. See Validity, fac- 
torial 

Factors, 3-4, 26, 36, 72, 87, 147, 
166, 212, 213-15, 406-7, 413-15; 
group, 411; second order, 411-12; 
special, 4I1 

Faculties, 73, 75, 225, 414 

Faculty psychology, 26, 93 

Familiarity, 48-49 

Family, 305 ff., 369, 373 
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Feeble-mindedness, 378, 379, 380 
Financial rewards. See Income 
Finger Dexterity Test, 227 

First grade entry, 106, 391 

Form board, 168 

Formal discipline, 74 

Foster home, 306-12, 377, 404-7 


General factor, 4, 80, IES, 136, 213; 
4I0O-I1I 

General intelligence, 38, Ch, HI, 
133534, 175, 197-98, 200-1, 210, 
222, 230-31, 232, 233, 235, 239; 
definitions of, 77 ff.; descriptions 
of, So ff, 

Genius, 297, 320-21, 378, 379, 382, 
384-85 

Gifted children, 347-48, 380-2, 384 

Global score, 90, 93, 128, 159, 207- 
8, 210-11, 213, 217, 241, 352; 
390-99 

Grade norms, 154-55 

Group norms, 402 

Group tests, 10 

Growth. See Linguistic develop- 


ment; Mental growth; Motor 
development 


Haggerty Intelligence Examination 
Delta, 151, 220, 297 

Haggerty-Olsen-Wickman Behavior 
Rating Schedules, 270 

Hand Tool Dexterity Test, 234 

Harlem, 332 


Height, 366 

Henmon-Nelson Test of Mental 
Ability, 34, 38, 152-55 

Heredity, 49, 79, 81, 246, 314, 328, 
332, 340 371, 373 ff. 

Herring-Binet Scale, 118-20, I61 

Higher mental Processes, 16, 23 

“Hollow” folk, 300-1, 303-4 

Home Inventory Scale, 310-11 

Humm-Wadsworth Temperament 
Scale, 49-50, 266-67 
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Ideational Learning Test, 86 

Idiot, 380 

Illinois Intelligence Examination, 
sI8 

Imbecile, 380 

Income, 305, 376-77 

Index of brightness, 395 

Indians, American, 59-60, 174-75, 
328, 330, 332 

Individual differences, 382-83 

Individual tests, 9-10, 99 

Infant speech, 352 

Infant tests. See Young children, 
tests for 

Inhibition, 291 

LE.R. Arithmetic Test, 325 

L.E.R. Assembly Test for Girls, 
233 

LE.R. Intelligence Scale CAVD, 
35, 38, 47, 85, 90, 137-40, 202, 
217, 347, 381 

Institutional setting, 324 

Intellect CAVD, 139 

Intelligence, estimating, 90-92 

Intelligence quotient, 106-7; Io8- 
IE, £126, 127, 32, 133; 150-57; 
158, 165, 167, 184, 200, 216; 
changes in 345, 347; classification 
of, 378-79; constancy of, 105-6, 
191-92, 317, 339-40, 341-47, 
394; early, 193, 346-475 gains, 
318-19, 323; high, in eminent 
persons, 384-85; meaningfulness 
of, 393-94; stability of, 108-9, 
133, 345, 392-93 

Intelligence tests, 41, 42, Ch. V, Ch. 
VI; appraisal of, 215 ff., 220-22; 
for high school and college, 
I6I ff.; special purpose, 239, 240, 
244, 372; trends in, 215-17; Vo- 
‘cational uses of, 208 

Interest, 272, 348-49; and success, 
273; measures of, 41, 272ff.; per- 
manence of, 274 

Interest groups, 274-75 
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Interest Questionnaire for High 
School Students, 277 

Interform coefficient, 52 

International Intelligence Test, 312 

Interrelations of tests, 40-41, 60- 
61, 69, 134535, 144, 145, I54, 
I55, 158, 159, 160, I6I, 169-70, 
I]t, T7I-72, I74, 183-84; 194, 
218-20, 230-37, 303, 391 

Interval Discrimination Test, 249- 
50 

Iowa Tests for Young Children, 
185-87 

Iowa University Placement Ex- 
amination, 6, 206, 240-41, 244, 
372, 403 


Kafiirs, 69 

Koerth Pursuit Test, 228 

Kohs Block Design Test, 170 

Kuder Preference Record, 280-83 

Kuhlmann-Anderson Test, 161, 219, 
221 


Kuhlmann-Binet Scale, 121-22, 221, 


297, 310, 345, 346 
Language, 329-30 
Language tests, $9 
Latin prognosis, 245-46 
Latin Prognosis Test, 226, 245, 254 
Law Aptitude Examination, 240 
Law Aptitude Test, 240 
Learning, 85-7, 92, 137, 349, 353 
Length of test, 46-47 
Level, 35. See also altitude 
“Likert” technique, 284 
Linguistic development, 351, 352, 
355; 
Logical Decision Test, 270-71 


McAdory Art Test, 251 
McCall Multi-Mental Scale, 220 
Mannikin Test, 168 

Mare and Foal Test, 168 
Marks, 42, 221, 241-42 
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Mathematics Aptitude Test, 241 
Maze Test, 170, 175, 357 
Mazes, 3 
Measurement, conditions of, 20 ff.; 
logic of, 33-34, 68-70, 259-60, 
335, 364-72, 374-76, 401-3 
Measures of Musical Talent, 43, 
55, 61, 226, 247-49, 399 
Mechanical aptitude, 226, 229; 
tests of, 41, 22g ff. 
Mechanistic psychology, 22-23, 25 
26, 84 
Medical Aptitude Test, 19-20, 42, 
234-40, 244, 372 
Meier Art Test, 252-53 
Meier-Seashore Art Judgment Test, 
252 
Mens, 258-59; musical, 250 
Mental age, 61, 67-68, 103-6, 108 
IL, ISI-32, 149-50, 156-57, 165, 
167, 171, 200, 327, 352, 383, 
390-91; of population, 149-50, 
107 
Mental decline, 358 
Mental growth, 104-5, 107, 108, 
123-25, 127, 128, 188, 194-95, 
350 ff., 394; adult, 354-59; curve 
of, 123, 125-26, I6I, 359-03, 
391, 395; early, 351-53 
Mental maturity, 100 
Mental processes, 8-9 
Mental units, 125, 16r 
Mentality, 2904-98; waste of, 328 
Mercator Projection, 32-33, 364, 
368, 390, 391 
Merrill-Palmer Scale, 180-84, 104, 
393, 394 
Methods of work, 15 
Metropolitan Readiness Tests, 244 
Migration, 301-2, 332-33 
Miles Drill Test, 228 
Miller Mental Ability Test, 219 
Miner's Analysis of Work Interest, 
280 
Minnesota Home Index, 377 


Minnesota Interest Analysis Test, 
232 

Minnesota Mechanical Assembly 
Test, 232 “ 

Minnesota Multiphasic Personality 
Inventory, 267-69, 271, 283, 399. 

Minnesota Paper Form Board, 232, 
233, 332 

Minnesota Preschool Scale, 0 
80 

Minnesota Rate of Manipulation 
Test, 227 

Minnesota Spatial Relations Test. 
232, 233 

Minnesota Vocational Test for 
Clerical Workers, 226, 238-39 

Moral conduct, measures of, 289 ff 

Moral knowledge, 289-90 

Moral opinion, 291 

Moron, 380 

Motivation, 5 

Motor ability, 228-29; measures of, 
226 ff, 

Motor Achievement Test, 187-88 

Motor age, 188 

Motor development, 188, 351 

Musical Memory Test, 250 

Musical persons, 259 


National Intelligence Tests, 4, 151- 
52, 153, 218, 220, 229, 300, 361 
Negativism, 194 
Negroes, 328-29, 330, 337, 332, 380 
Non-language tests, 9 
Non-profit Organizations, 166-67 
Non-racial factors, 330-32 
Non-verbal tests, 9 
Normal distribution, 
364 ff., 308 
Normality, assumption of, 367-72 
Normalization, 368-69 
Northumberland, 297-98, 3oI-2 


65, 91-92, 


Objectivity, 32, 59 ff. 
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Observation, 35-36, 366 

Occupational Orientation Inquiry, 
279-80 

Occupations, 198-99, 283, 295-98 

Ohio State University Psychological 
Test, 165-66, 221 

Organization of mind, 195 

Originality, 15-16, Sr 

O’Rourke Mechanical 
Test, 65, 234-35 

Orphanage, 323-24 

Otis Group Intelligence Test, 155, 
219, 220, 221, 222 

Otis Quick-Scoring Mental Ability 
Tests, 158-59, 202, 388 

Otis Self-Administering Test of 
Mental Ability, 42, 65-66, 114, 
144, 155-58, 203, 217, 219, 220, 
221, SII, 312, 325, 332, 354 


Aptitude 


Par placement, 186-87, 200 

Percent of average, 126, 395-96; 
stability of, 127, 128 

Percentile norms, 154 

Percentile scores, 65-66 

Percentiles, 396 

Performance tests, 9, 89, I67 ft, 
194, 300, 301, 302-3, 330, 355, 
391; values of, 171-72, 175 

Persistence, 201 

Personal constant. See Percent of 
average 

Personality, 258, 260, 262-63, 407; 
total, 384-85; types, 87-90, 258- 
59, 266, 267 

Personality quotient, 253 

Personality Quotient Test, 262-63 

Personality tests, 43, 260 ff., 406; 
evaluation of, 271-72 

Phrenology, 20, 73 

Pintner-Cunningham Primary Test, 
159, 219, 244, 300 

Pintner General Ability Tests, 244; 
non-language series, 172-74, 329; 
verbal series, 159 
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Pintner Intelligence Test, 159 

Pintner Non-Language Group Test, 
330, 393, 394 

Pintner-Paterson Scale of Perform- 
ance Tests, 168-70, 219 

Pintner-Paterson Short Scale, 332 

Pintner Rapid Survey Test, 299 

Point Scale for the Measurement of 
Intelligence, 118 

Point Scale of Performance Tests, 
170-71 

Point scales, 89, 114, IIS ff.; values 
of, 120, 129 

Power, 147-49 

Practical problems, 15 

Practice effect, 322, 346 

Practice material, 49-50, 152, 165 

Prediction, 390 

Preschool, 315-24, 353 

Pressey  Interest-Attitude 
288-89 

Pressey Primary Scale, 303 

Primary mental abilities, 281, 403, 
4II-I2. See also Factor analysis, 
Factors 

Probability, 366-67 

Probable error, 132 

Professional and academic aptitude 


Tests, 


tests, 239 ff. 
Profile scores, 93, 159, 165, 167, 
207-8, 211, 213-14, 217, 248, 


255, 261-62, 399-400 
Profiles, 44, 73, 383 
Progressive education, 23-24 
Projection, 416 
Projective tests, 7-8, 13, 23, 29, 
257, 415 ff. 
Psychoanalysis, 25 
Psychological theory, 22-26 
Psychometric tests, 7-8 
Psychotics, 88, 89, 116 


Quantity and quality, as aspect of 
psychological test scores, 383- 
55 


480 TOPICAL INDEX 


Race, 328 fi. 
Racial purity, 329 
Radiotelegraph Operators, Test for, 
404-5 
Range of difficulty, 47 
Range of intellect, 35, 82-83, 84, 
90, 137, 139 
Rapport, 50, 331 
Rational Learning Test, 85 
Raw scores, 62, 63 
Reality of traits, 36-37, 75-76 
Reasoning, 359 
Recall, immediate, 3 
Reconstruction, 81-82 
Reliability, 31, 45 ff.; coefficient of, 
51, 55-56, See also Interform 
coefficient, Retest coefficient, 
Split-half coefficient; degree of 
57-59; recording of, 51-56. 
Retardation, 352, 375 
Retest coefficient, 52 
Retroactive inhibition, 25 
Revised Stanford-Binet Scale, ad- 
ministration of, 106; criticisms 
of, 108-18, reliability of, IOU-ILI; 
scaling of, 103-6; scoring of, 
106-7, standardization of, 103-6. 
See also Stanford-Binet revisions. 
Rogers Interpolation Test, 245 
Rorschach Test, 418-21 
Sample, 64-65 
Scalability, 285 
Scales, 96 
Scaling, To, 47 
Scatter pattern, 89 
School achievement. See Educa- 
tional achievement 
School continuance, 315 
School environment, 197-98 
Schooling, 150, 337, 373; effects of 
later, 325-27; and mentality, 
315 ft. 
Science Research Associates (Test) 
of Primary Mental Abilities, 212- 
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’ 


Scores, significance of, 363-64, 
378 fi. 

Scores, social meaning of, 380- 
83 i 

Scores, stability and meaningful- 
ness of, 380 ff. 

Scores, types of. See Global scores, 
Percentile scores, Profile scores, 
Raw scores, Standard scores 

Scoring devices, 154, 158, 164, 215 
16 

Scrambled Organization, 144, 154, 
156, 215 

Screening tests, 206, 216, 405-6 

Singing, 352 

Seashore Measures of Musical 
Talent. See Measures of Musical 
Talent 

Seashore Motor Rhythm Test, 228, 
229 

Seguin Form Board, 168 

Selection, 301-2, 315 

Set, mental, 49 

Sex differences, 386 

Single-item tests, 216 

Skin color, 329 

Social significance, 81, 82 

Socio-economic factors, 294 ff., 330- 
31 

Socio-economic Scales, 376 

Socio-economic status, 373 

Spearman-Brown Prophecy For- 
mula, 46-47, 53 

Special abilities, 235-36 

Speed, 35, 81, 83, 84, 137, 147-49, 
169, 201, 211, 331-32, 357 

Spiral-omnibus, See Scrambled or- 
ganization 

Split-half coefficient, 53 

Stability, 389-90. See also Intelli- 
gence quotient, stability of; Per 
cent of average, stability of 

Standard deviation, 66 

Standard deviation scores, 396 

Standard error, 54 
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Standard scores, 66-67, 132 

Standardization, 33, 61 ff., 197, 216- 
17, 219-20, 312, 335, 364-65, 
404 ff. 

Standardization group, 64-65, 67 

Stanford Achievement Test, 221, 
32 

Stanford-Binet 
106 

Stanford-Binet revisions, 17-18, 59, 
67, 85-86, 88, 90, 97-IIS, 122, 
123, 125, 126, 135-36, 145, 149- 
50, 161, 169, I7I, 174, 17S, I83, 
I9I-92, 194, 200, 202, 218, 219, 
220, 221, 296, 300, 302, 310, 312, 
321, 328, 330, 332, 335, 341-45, 
346, 347, 353, 355, 362-63, 377%, 
392, 304, 402, 413. See also Re- 
vised Stanford-Binet Scale, Stan- 
ford Revision of Binet Scale 

Stanford Later Maturity Study, 
3575588 

Stanford Motor Skills Test, 227- 
28 

Stanford Revision of Binet Scale, 
standardization groups, 100-2; 
standardization of, 116; valida- 
tion of, 102-3. See also Stanford- 
Binet revisions 

Stanford Scientific Aptitude Test, 
253-54 

Statistical measures, 396-98 

Stenquist Assembly Test, 230-31, 
232 

Slut Measures of Mechanical 
Aptitude, 42, 234 

Study of Values, 286-87 

Subject, 48-50 

Subtests, 2; intercorrelation of, 147 

Success, vocational, 43, 44 

Summated ratings, 284 


profiles, 88-89, 


Talent, 225-26, 246-47, 253 
Talent tests, 47, 246 ff. 
Teacher’s ratings, 42 


Temperament, 257, 258 

Terman Group Test of Mental 
Ability, 42, 148, 161-92, 163, 
219, 220, 221, 388 

Terman-McNemar Test of Mental 
Ability, 162-63 

Test instructions, 50 

Test items, 2, 9, 10-14, 34, 37739, 
48, 201, 217; independence of, 
47-48 

Test of Mechanical Comprehension, 
236 

Test of Public Opinion, 288 

Tests, early, 73; emerging types of, 
208 ff.; improvement of, Ch. XI, 
363-64; limitations of, 10 ff., 14- 
16; nature of, 1-2; types of, 7 ff.; 
values of, 10 ff., 16-22, 24, 45, 
150, 217-18 

Tests of Fundamental Abilities of 
Visual Arts, 250-51 

Tests of Mental Development, 10, 
122-28 

Thematic Apperception Test, 417 

Thorndike Intelligence Examination 
for High, School Graduates, 19, 
163, 202, 228, 235 

Thorndike-McCall Reading Scale, 
325 

Thorndike Test of Word Knowl- 
edge, 245 

“Thurstone” technique, 283, 285 

‘Time between testings, 345-46 

Trabue Language Completion Scale, 
145 

Trait, 225-26, 257, 265 

Truck drivers, 405 

True-false tests, 60 

Trustworthiness, 35, 38 

Tweezer Dexterity Test, 227 

Twins, 312-14, 377 


U. S. Armed Forces Institute Tests 
of General Educational Develop- 
ment, 5-6, 205-6 
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Units of measurement, 388-89 
Universal test, 69, 402 

University of Iowa, 319 
University of Minnesota, 242-43 
Unreliability, causes of, 46-51 
Urban-rural differences, 298, 301, 
" 302-3 


Validity, 30, 34 ff., 44-45, 166, 217, 
228-29, 248-49, 254, 388-89, 
407; establishment of, 39-45; 
factorial, 36, 39; practical, 39 

Value, 258 

Variability, 123 

Variable error, 127 

Variance, 53 

Verbal tests, 302-3, 330, 3971; ctf. 
performance tests, 219-20 

Vocabulary test, 38, 59, III-I3, 
193, 355-57 

Vocational aptitudes, 
236 ff. 

Vocational Interest Blank for Men, 
43, 277-79, 280, 281, 413 


tests for, 
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Vocational Interest 
Women, 279 


Blank for 


Waste of mentality, 328 « 

Wechsler-Bellevue Scale, 89, 90, 93, 
107, 129-36, 137, 140, 178, 202- 
3, 206, 216, 222, 400; form B, 
I3I, 135-36 

Weight, 366 

Whole-part learning, 86 

Winnetka, Il, 319, 322 

Wonderlic Personnel Test, 203, 206, 
216 

Word Association Test, 417 

Work sample tests, 226, 245 

World War I, 142-51, 197, 295, 
353 

World War II, 20, 43-45, 205-6, 
216, 295, 404-7 


Young children, tests for, 178 ff., 
190-96, 32I 


Zero intelligence, 139, 362 
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